IQR is one of the most important topic that comes under descriptive statistics. This article is going to speak in length about it. If you read my previous article about handling outliers, you must have come across a small introduction to the IQR. The interquartile range (IQR) is nothing but the difference between the upper and lower quartile. A quartile is a set of values that divides a data set into groups of 3.
IQR is also known as:
It acts as a measure of variability for skewed distributions and it is based on dividing a data set into 3 quartiles. Q1, Q2, and Q3 are the first, second, and third quartiles. Extreme values have no effect on the IQR, which is one of the main reasons why it is considered the best to use as a measure of dispersion because of how resistant it is.
It is calculated by subtracting Q3, which is the upper quartile from Q1, which is the lower quartile.
where:
Let’s say, for instance, that you have a set of information containing the data points 2, 4, 6, 8, 10, 12, 14, 16, 18, and 20. From these figures, I can conclude the first quartile to be 6.5, the second quartile to be 11, and the third quartile to be 15.5. To verify my conclusion, if you draw these points on a number line, you will be able to see that these three numbers divide the number line into 3 quarters from 2 to 20. Now when I subtract the third quartile from the first one (15.5 – 6.5), I get the answer 9 which is exactly what we identify as the interquartile range.
The IQR can be used in several ways which include:
It is also used to measure how spread out the data points in a dataset are (from the mean of the dataset). If the data points are spread out way more than usual, then the IQR in response to that will be higher. Similarly, if the data points are spread out way less than it normally is, then the IQR will be smaller.
If we want to test whether or not a specific population has a normal distribution, we can use the interquartile range formula along with the mean and standard deviation.
The following is the formula for testing a population:
Q1 – (σ z1) + X
Q3 – (σ z3) + X
where:
Once you have solved both of those equations, compare the results. If you happen to witness a huge amount of difference between the results, then you know for sure that the population is not normally distributed. Similarly, if there is no much difference between the first and third quartile, that means that the population is normally distributed.
Given above is an example of a dataset from which we will be calculating the interquartile range using MS Excel.
The following are the steps to calculate the IQR using Excel:
Output:
According to ThoughtCo., “Outliers are individual values that fall outside of the overall pattern of a data set.” It is a data point that is dramatically different from other data points. We can identify outliers with the help of the interquartile range system. Let’s take a closer look at this with the help of a box plot.
A Box Plot is a graphical representation of a data set’s statistical five-number summary.
A box plot accurately represents the distribution of information. It shows how much or how little the data is spread out and allows us to understand everything from its range to its skewness and so on.
As observed in the image above, Minimum is the lowest value in the dataset whereas Maximum is the highest value in the dataset. The difference between maximum and minimum reveals information about the dataset’s range.
The second quartile is considered the Median. 25% of the data is between the minimum and Q1, while 75% of the data is between the minimum and Q3.
By now we already understand how to calculate the IQR. However, if we want to use the IQR method to identify outliers, we will need to establish a unique range. An outlier will be considered as any data that does not fall within this range.
The formula for identifying an outlier is as follows:
An outlier is any data that is less than the Lower Bound or greater than the Upper Bound.
If you’re wondering why we use 1.5 rather than any other random number to identify outliers, let me tell you that I was as well, which led me to conduct extensive research into why that was the case.
Assume we use a larger scale instead of 1.5, the outliers would appear to us as data points as a result. In a different scenario, if we use a smaller scale instead of 1.5, we will see data points misclassified as outliers. The scale of 1.5 is more symmetrical than any other scale. It assists us in accurately identifying outliers and data points and does not confuse the two.