Let us first understand what an outlier is and how to spot one. In simpler terms, an outlier is the odd man out or someone who sticks out from a large crowd. In statistics, an outlier is any data point that deviates considerably from the rest of your data points. They are important because they can change the result of our data analysis. Therefore, we must learn how to handle outliers in data .
Outliers are extreme values. If our dataset has extremely high or extremely low values, then those data points are termed as outliers. It indicates values either heavily tailed or highly skewed. It may occur due to variability in measurement or due to experimental error.
Outlier indicate that the distribution is heavily tailed or highly skewed.
If a value is extremely high, much outside the predicted range, the distribution will move to the right, making the curve right or positively skewed. If a value is extremely low, the distribution will shift to the left, causing the curve to be left or negatively skewed.
Outliers are classified into three types:
- Collective Outliers: They are a group of unusual findings that look similar to one another due to their comparable abnormal value. In time series figures, this might appear as typical peaks and valleys happening outside of a time period when the seasonal sequence is normal.
- Contextual Outliers: A data point is called a contextual outlier if its value deviates drastically from the rest of the data points in the same context.
- Collective Outliers: They are a group of unusual findings that look similar to one another due to their comparable abnormal value.
How to Handle Outliers in Data
To demonstrate, I created a data set with only one column(Age). I have included two values: one that is high(100) and the other low(-50).
Sample Dataset: 10,12,34,25,20,22,30,33,40,100,-50
A box plot is a graphical depiction of the distribution of statistics. It makes use of the median as well as the lower and upper quartiles. A Box plot can readily spot an unusual point in the data set since any point above or below the whiskers is an anomaly. Sometimes referred to as the “Univariate method.”
Box Plot is a statistical plot to visualize descriptive statistics( Mean, Median, Q1, Q2, IQR, Minimum, Maximum). I will be using R software.
#First intsall "readxl" package to read excel file into R install.packages("readxl") #load readxl package library(readxl) my_data <- read_excel("Outlier.xlsx") boxplot(my_data)
A histogram in which the majority of the information is on one side while a few observations appear distant from the main group are termed as outliers. Observationss outliers.
Histogram also detects outliers. R code to generate histogram.
Age<- my_data$Age hist(Age,xlab = "Age",col="darkmagenta", xlim = c(-60,100), ylim = c(0,10),main="Outlier Detection using Histogram")
Inter Quartile Range(IQR)
The interquartile range rule is important for spotting outliers. Inter Quartile Range score or middle 50% or H-spread is a measure of statistical dispersion, being equal to the difference between the 75th percentile and 25th percentile i.e., third quartile(Q3) and first quartile(Q1)
We identify the outliers as values less than Q1 -(1.5 * IQR) or greater than Q3+(1.5 * IQR)
#Intsall "readxl" package install.packages("readxl") #load readxl package library(readxl) #Read the excel file my_data <- read_excel("Outlier.xlsx") #summary() computes summary statistics of data and model objects. summary(my_data) Age Min. :-50.00 1st Qu.: 16.00 Median : 25.00 Mean : 25.09 3rd Qu.: 33.50 Max. :100.00 #Assigning Values Q1<-16.00 Q3<-33.50 #Calculating IQR IQR<-Q3-Q1 IQR  17.5 Lower=Q1-(1.5*IQR) Upper=Q1+(1.5*IQR) Lower  -10.25 Upper  42.25 #Any number smaller than this is an outlier my_data[my_data$Age<Lower,] -50 #Any number larger than this is considered an outlier my_data[my_data$Age>Upper,] 100
A measure of how the values in a data set vary or deviate from the mean.
We identify the outliers as values less than (Mean – 3*SD)or greater than (Mean + 3*SD).
R Code to calculate standard deviation
#load the readxl library library(readxl) #Read the excel file my_data <- read_excel("Outlier.xlsx") #summary() computes summary statistics of data and model objects. summary(my_data) Age Min. :-50.00 1st Qu.: 16.00 Median : 25.00 Mean : 25.09 3rd Qu.: 33.50 Max. :100.00 #calculates the standard deviation sd(my_data$Age)  34.74609 #Values greater than Mean + 3*sd are considered as outlier my_data[my_data$Age > Mean + (3*sd),] 100 #Values less than Mean - 3*sd are considered as outlier my_data[my_data$Age < Mean - (3*sd),] -50
A scatter plot helps in determining the degree of correlation between two numerical variables, such as a simple linear relationship between X and Y. An outlier is any observation that deviates from the ordinary.
ML models & Outliers
In machine learning, outliers usually mislead the interpretation of the data as a whole, resulting in less accurate models and incorrect results during the training process. Although not all the machine learning models are sensitive to outliers. ML models sensitive to outliers listed below.
- Linear Regression
- Logistic Regression
- Principal Component Analysis
How to handle outliers?
- Calculating Minkowski Error can reduce the impact of outliers on the model.
- Removing those records from the data set completely to reduce the skewness caused during analysis.
- Assigning new values to such records can result in favourable outcomes.
- Transforming values can eliminate outliers. Scaling, log transformation, cube root normalization and other techniques eliminate outliers.
- Imputing outliers and treating them separately.
Not all outliers prove to be troublesome in a dataset. For example, in credit card fraud detection, it is considered to be important to notice any unusual transaction that has taken place and in stock forecasting, any unusual spike has to be considered. Hence it is very important to understand the dataset and the problem statement before we get to deal with the outliers.