8 min read

#### What is Tableau? | Beginner's Guide to Key Features

Tableau is a powerful and intuitive data visualization software that has gained significant popularity in recent years. It allows users to quickly...

Welcome to Part II of learning about outliers! In my previous blog, I discussed what an outlier is, where they come from, and how to differentiate between outliers, high leverage points, and influential points. In this blog, I’ll walk through univariate approaches (data containing only one variable) for detecting and diagnosing outliers.

So, without any further ado, let’s get started!

The simplest way to detect outliers is by drawing box plots. Box plots, also known as box and whisker plots, are an easy way to observe the distribution of the data you’re analyzing. The box demonstrates the central 50% of the data, with a line in the middle that shows the median value. The lines extending from the box capture the range of the remaining data. Any data point that falls outside the lines indicates an outlier. In the plot below, we can see that the red dot sitting outside of the boxplot is an outlier:

*R code for drawing boxplot:*

boxplot( x, outcall = “red” )

See the section on *Interquartile Range* below for more information on how we arrive at the minimum and maximum values depicted by lines used to determine our outlier thresholds.

Calculating the Z-score provides a way to understand how far the data points are from the mean. For example, knowing someone’s test score was 82% might be good information, however, if you want to compare it to the average person’s score then you would want to look at the Z-score. The Z-score will be able to show you what the test score is in comparison to the average population’s mean score, all while accounting for the natural ‘spread’ of the data.

Z-score formula = (xi - mean) / standard deviation

- Calculate z-score using the formula for each data point
- Define a threshold and compare with Z-score
- Mak each data point that is outside the absolute value of Z-score as an outlier

*R code for calculating Z-score:*

thresh<- 3

mean<- mean(x)

std<- sd(x)

library(tidyverse)

data<- data %>%

mutate(zscore = (x - mean(x))/sd(x))

data<- data %>%

mutate(data,outlier= ifelse(zscore>thresh, "Outlier", "Not Outlier"))

Here I use the **tidyverse** package in R to help calculate the z-score for this example. Running this R-code will add two columns in your dataset, one that shows the z-score for each xi and one that indicates whether it is an outlier or not. From the output we can see that the data has one outlier:

The Interquartile range (IQR) consists of the central 50% of the data, and contains the majority of your data points. In a boxplot, the IQR is the box portion between the first and the third quartile. IQR can be used to calculate the lower and upper bounds of the data, which helps identify outliers.

- Sort the data in ascending order
- Calculate Q1 (25th percentile) and Q3 (75th percentile)
- Calculate IQR = Q3 - Q1
- Compute lower bound = (Q1 - 1.5 IQR)
- Compute upper bound = (Q3 + 1.5 * IQR)
- Mark each data point that falls outside the lower and upper bounds as outliers

*R Code for an IQR function that returns exact values of your outliers:*

IQR_function <- function(x) {

Q1<-quantile(x,0.25)

Q3<-quantile(x,0.75)

IQR<-(Q3-Q1)

left<- (Q1-(1.5*IQR))

right<- (Q3+(1.5*IQR))

c(x[x <left],x[x>right])

}

IQR_function(x)

This function calculates the lower (Q1) and upper (Q2) bounds of the interquartile range of your data, then finds the minimum and maximum limits. It then outputs the exact value of the outliers in your data.

Statistical Tests for Detecting Outliers for Univariate Data

Dixon’s Q Test can be used to identify a single outlier based on a hypothesis. This is typically used on a smaller sample ( greater than 3 and less than 30 ) and the data is assumed to be normally distributed.

*Null hypothesis (H0):* There are no outliers in the data

*Alternate hypothesis (Ha):* The maximum or minimum value is an outlier

*R code and Output:*

Since the p-value is less than 0.05, we have enough evidence to reject the null hypothesis and conclude that the maximum value twenty is an outlier.

Grubb’s Test is also a hypothesis-based approach similar to Dixon’s Test. However, this test can be used for a sample size greater than 6 and the data is normally distributed.

*Null hypothesis (H0):* There are no outliers in the data

*Alternate hypothesis (Ha):* The maximum or minimum value is an outlier

I’m using the Grubbs.test from the **outliers** package in R. From the output above we can see that the p-value is less than 0.05, which means we have enough evidence to reject the null hypothesis and state that the maximum value 20 is an outlier.

If you’re working with multiple independent variables There are more advanced statistical tests made specifically for **bivariate **and **multivariate **datasets, however, that is beyond the scope of this blog. If you’re curious, check out this blog that goes into more detail!

When you encounter outliers in your data, there are a few ways to proceed

If there is no reasonable scientific basis for an outlier to exist in the dataset, one of the easy ways to tackle the issue is to simply remove the data point and pretend like it was never there.

If you have just a few data points that are outliers, you could replace them with the next closest value to the median. This approach will allow the distribution to be the least affected.

Another way to tackle outliers is by replacing them with the mean of the remaining values without the outlier. With this approach you run the risk of distorting the distribution, so oftentimes analysts prefer to use option two.

Oftentimes, outliers are overlooked by analysts. As the use of statistical software and pre-packaged functions continues to grow, analysts need to be extra vigilant to avoid believing that the results returned by their analysis are accurate, when in reality there are outliers skewing the results. Always remember to check the data for outliers to avoid bias and incorrect results in your study.

It is a good practice to conduct your analysis using the full data both with AND without the outliers. If the results are similar, it is better to use the original data.

I hope this blog has given you some practical tips on how to detect and deal with outliers in your datasets. If you’re struggling to get accurate results from your own models and would like to work with our team of experts, feel contact us using the link below.

Happy Outlier hunting!

8 min read

Luke Komiskey: Mar 25, 2023

Tableau is a powerful and intuitive data visualization software that has gained significant popularity in recent years. It allows users to quickly...

3 min read

Luke Komiskey: Mar 25, 2023

Hello, data enthusiasts! The Tableau Conference 2023 is fast approaching, and our DataDrive team is here to help you prepare for this exciting event...

5 min read

Luke Komiskey: Mar 24, 2023

As organizations increasingly rely on Snowflake's cloud data platform to support their data analytics needs, managing costs becomes a critical aspect...