Let us first understand what an outlier is and how to spot one. In simpler terms, an outlier is the odd man out or someone who sticks out from a large crowd. In statistics, an outlier is any data point that deviates considerably from the rest of your data points. They are important because they can change the result of our data analysis. Therefore, we must learn how to handle outliers in data .
Outliers are extreme values. If our dataset has extremely high or extremely low values, then those data points are termed as outliers. It indicates values either heavily tailed or highly skewed. It may occur due to variability in measurement or due to experimental error.
Outlier indicate that the distribution is heavily tailed or highly skewed.
If a value is extremely high, much outside the predicted range, the distribution will move to the right, making the curve right or positively skewed. If a value is extremely low, the distribution will shift to the left, causing the curve to be left or negatively skewed.
To demonstrate, I created a data set with only one column(Age). I have included two values: one that is high(100) and the other low(-50).
Sample Dataset: 10,12,34,25,20,22,30,33,40,100,-50
A box plot is a graphical depiction of the distribution of statistics. It makes use of the median as well as the lower and upper quartiles. A Box plot can readily spot an unusual point in the data set since any point above or below the whiskers is an anomaly. Sometimes referred to as the “Univariate method.”
Box Plot is a statistical plot to visualize descriptive statistics( Mean, Median, Q1, Q2, IQR, Minimum, Maximum). I will be using R software.
#First intsall "readxl" package to read excel file into R
install.packages("readxl")
#load readxl package
library(readxl)
my_data <- read_excel("Outlier.xlsx")
boxplot(my_data)
A histogram in which the majority of the information is on one side while a few observations appear distant from the main group are termed as outliers. Observationss outliers.
Histogram also detects outliers. R code to generate histogram.
Age<- my_data$Age
hist(Age,xlab = "Age",col="darkmagenta", xlim = c(-60,100), ylim = c(0,10),main="Outlier Detection using Histogram")
The interquartile range rule is important for spotting outliers. Inter Quartile Range score or middle 50% or H-spread is a measure of statistical dispersion, being equal to the difference between the 75th percentile and 25th percentile i.e., third quartile(Q3) and first quartile(Q1)
IQR=Q3-Q1
We identify the outliers as values less than Q1 -(1.5 * IQR) or greater than Q3+(1.5 * IQR)
#Intsall "readxl" package
install.packages("readxl")
#load readxl package
library(readxl)
#Read the excel file
my_data <- read_excel("Outlier.xlsx")
#summary() computes summary statistics of data and model objects.
summary(my_data)
Age
Min. :-50.00
1st Qu.: 16.00
Median : 25.00
Mean : 25.09
3rd Qu.: 33.50
Max. :100.00
#Assigning Values
Q1<-16.00
Q3<-33.50
#Calculating IQR
IQR<-Q3-Q1
IQR
[1] 17.5
Lower=Q1-(1.5*IQR)
Upper=Q1+(1.5*IQR)
Lower
[1] -10.25
Upper
[1] 42.25
#Any number smaller than this is an outlier
my_data[my_data$Age<Lower,]
-50
#Any number larger than this is considered an outlier
my_data[my_data$Age>Upper,]
100
A measure of how the values in a data set vary or deviate from the mean.
We identify the outliers as values less than (Mean – 3*SD)or greater than (Mean + 3*SD).
R Code to calculate standard deviation
#load the readxl library
library(readxl)
#Read the excel file
my_data <- read_excel("Outlier.xlsx")
#summary() computes summary statistics of data and model objects.
summary(my_data)
Age
Min. :-50.00
1st Qu.: 16.00
Median : 25.00
Mean : 25.09
3rd Qu.: 33.50
Max. :100.00
#calculates the standard deviation
sd(my_data$Age)
[1] 34.74609
#Values greater than Mean + 3*sd are considered as outlier
my_data[my_data$Age > Mean + (3*sd),]
100
#Values less than Mean - 3*sd are considered as outlier
my_data[my_data$Age < Mean - (3*sd),]
-50
A scatter plot helps in determining the degree of correlation between two numerical variables, such as a simple linear relationship between X and Y. An outlier is any observation that deviates from the ordinary.
In machine learning, outliers usually mislead the interpretation of the data as a whole, resulting in less accurate models and incorrect results during the training process. Although not all the machine learning models are sensitive to outliers. ML models sensitive to outliers listed below.
Not all outliers prove to be troublesome in a dataset. For example, in credit card fraud detection, it is considered to be important to notice any unusual transaction that has taken place and in stock forecasting, any unusual spike has to be considered. Hence it is very important to understand the dataset and the problem statement before we get to deal with the outliers.