IQR is one of the most important topic that comes under descriptive statistics. This article is going to speak in length about it. If you read my previous article about handling outliers, you must have come across a small introduction to the IQR. The interquartile range (IQR) is nothing but the difference between the upper and lower quartile. A quartile is a set of values that divides a data set into groups of 3.
IQR is also known as:
It acts as a measure of variability for skewed distributions and it is based on dividing a data set into 3 quartiles. Q1, Q2, and Q3 are the first, second, and third quartiles. Extreme values have no effect on the IQR, which is one of the main reasons why it is considered the best to use as a measure of dispersion because of how resistant it is.
It is calculated by subtracting Q3, which is the upper quartile from Q1, which is the lower quartile.
- Q1 = (1/4) [(n+1)]th term)
- Q3 = (3/4)[(n + 1)]th term)
- n = the total no. of data points
Let’s say, for instance, that you have a set of information containing the data points 2, 4, 6, 8, 10, 12, 14, 16, 18, and 20. From these figures, I can conclude the first quartile to be 6.5, the second quartile to be 11, and the third quartile to be 15.5. To verify my conclusion, if you draw these points on a number line, you will be able to see that these three numbers divide the number line into 3 quarters from 2 to 20. Now when I subtract the third quartile from the first one (15.5 – 6.5), I get the answer 9 which is exactly what we identify as the interquartile range.
What is an Interquartile Range Used For?
The IQR can be used in several ways which include:
- Constructing graphical representations of a probability distribution
- Constructing box plots
- Identifying outliers
It is also used to measure how spread out the data points in a dataset are (from the mean of the dataset). If the data points are spread out way more than usual, then the IQR in response to that will be higher. Similarly, if the data points are spread out way less than it normally is, then the IQR will be smaller.
Using IQR as a Test for Normal Distribution
If we want to test whether or not a specific population has a normal distribution, we can use the interquartile range formula along with the mean and standard deviation.
The following is the formula for testing a population:
Q1 – (σ z1) + X
Q3 – (σ z3) + X
- Q1 = First quartile
- Q3 = Third quartile
- σ = Standard deviation
- z = Standard score or z-score
- X = Mean
Once you have solved both of those equations, compare the results. If you happen to witness a huge amount of difference between the results, then you know for sure that the population is not normally distributed. Similarly, if there is no much difference between the first and third quartile, that means that the population is normally distributed.
Calculating IQR In MS Excel 2007
Given above is an example of a dataset from which we will be calculating the interquartile range using MS Excel.
The following are the steps to calculate the IQR using Excel:
- Firstly, you need to enter your entire data into a single Excel column. (Just like the example given above)
- Secondly, Click any blank cell and type the formula “=QUARTILE(A1:A10, 1)” into it, then press enter. This will calculate the first quartile for you. Please enter the exact values for your dataset because only then will you be able to calculate IQR using Excel. For example, if your data is between D1 and D20, enter the formula “QUARTILE(D1:D20, 1)”. The “1” in the formula represents the first quartile.
- Thirdly, repeat the previous step to find the third quartile, but instead of “1,” enter “3” because it represents the third quartile. Therefore, the formula for calculating the 3rd quartile would be “=QUARTILE(A1:A10, 3)”.
- Lastly, for calculating the IQR, we will need to subtract Q3 from Q1. Since the value for Q1 is entered on C2 and the value for Q3 is entered on C3, the formula for subtracting Q3 from Q1 is “=C3-C2.”
Relationship Between an Outlier and IQR
According to ThoughtCo., “Outliers are individual values that fall outside of the overall pattern of a data set.” It is a data point that is dramatically different from other data points. We can identify outliers with the help of the interquartile range system. Let’s take a closer look at this with the help of a box plot.
A Box Plot is a graphical representation of a data set’s statistical five-number summary.
A box plot accurately represents the distribution of information. It shows how much or how little the data is spread out and allows us to understand everything from its range to its skewness and so on.
As observed in the image above, Minimum is the lowest value in the dataset whereas Maximum is the highest value in the dataset. The difference between maximum and minimum reveals information about the dataset’s range.
The second quartile is considered the Median. 25% of the data is between the minimum and Q1, while 75% of the data is between the minimum and Q3.
By now we already understand how to calculate the IQR. However, if we want to use the IQR method to identify outliers, we will need to establish a unique range. An outlier will be considered as any data that does not fall within this range.
The formula for identifying an outlier is as follows:
- Lower Bound: (Q1 – 1.5 * IQR)
- Upper Bound: (Q3 + 1.5 * IQR)
An outlier is any data that is less than the Lower Bound or greater than the Upper Bound.
If you’re wondering why we use 1.5 rather than any other random number to identify outliers, let me tell you that I was as well, which led me to conduct extensive research into why that was the case.
Assume we use a larger scale instead of 1.5, the outliers would appear to us as data points as a result. In a different scenario, if we use a smaller scale instead of 1.5, we will see data points misclassified as outliers. The scale of 1.5 is more symmetrical than any other scale. It assists us in accurately identifying outliers and data points and does not confuse the two.