7 Easy Techniques to Detect Anomalies in Pandas for Data Analysis

Claude Paugh
May 14
4 min read

Updated: May 20

Data analysis is an exciting journey, but it comes with its challenges. One of the biggest hurdles is identifying anomalies—unexpected results that can distort our conclusions and predictions. Whether you are analyzing sales data or monitoring system performance, recognizing these anomalies is critical. As a passionate user of Python’s Pandas library, I’ve discovered several practical techniques to efficiently spot these anomalies. In this post, I’ll share seven reliable methods that you can easily implement to enhance your data analysis.

pandas detecting anomalies — Pandas detecting anomalies

Understanding Anomalies

Anomalies, sometimes called outliers, are data points that significantly deviate from the overall distribution. They can stem from issues like measurement errors or true rare events. For example, if you are analyzing a dataset of customer purchase amounts, a single transaction of $10,000 in a dataset where most purchases are between $20 and $200 would be considered an anomaly. Recognizing these points is vital, as they can skew your insights and lead to incorrect decisions.

1. Visual Inspection with Box Plots

One of the simplest yet most effective ways to identify anomalies is through box plots. They clearly display the distribution of data and make it easy to spot outliers.

A box plot includes five statistics: the minimum, first quartile, median, third quartile, and maximum. Data points outside the "whiskers" are potential outliers.

import pandas as pd
import matplotlib.pyplot as plt

Sample data

data = {'Values': [1,3,4,6,78,100,210, 340, 3240, 234124, 462462, 562452, 45362435, 546452]}
df = pd.DataFrame(data)

Box plot

plt.boxplot(df['Values'])
plt.title('Box Plot of Values')

plt.show()

According to studies, box plots can effectively indicate outliers, making them an essential tool in your toolkit. A quick glance at the box plot will let you know if there are values that merit further analysis.

2. Z-Score Analysis

The Z-score offers a straightforward method to quantify how far a data point is from the mean. A value with a Z-score above 3 or below -3 usually signals an anomaly.

Here’s how to compute Z-scores using Pandas:

from scipy import stats
import pandas as pd

data = {'Values': [1,3,4,6,78,100,210, 340, 3240, 234124, 462462, 562452, 45362435, 546452]}
df = pd.DataFrame(data)

df['Z-Score'] = stats.zscore(df['Values'])
anomalies = df[(df['Z-Score'] > 3) | (df['Z-Score'] < -3)]

print(anomalies)

3. Interquartile Range (IQR)

The Interquartile Range (IQR) is another robust method for spotting anomalies. The IQR is calculated as the difference between the first (25th percentile) and third quartiles (75th percentile), effectively capturing the middle 50% of the data.

Q1 = df['Values'].quantile(0.25)
Q3 = df['Values'].quantile(0.75)

IQR = Q3 - Q1
print(f"IQR: {IQR}".format(IQR=IQR))

Define bounds for identifying outliers

lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

anomalies = df[(df['Values'] < lower_bound) | (df['Values'] > upper_bound)]

print(anomalies)

Research shows that using IQR can significantly minimize the impact of extreme values. For example, in a dataset of 500 sales records, applying IQR can help recognize transactions that aren't consistent with customer buying patterns.

4. Time Series Analysis with Rolling Mean and Standard Deviation

For time series data, rolling statistics are particularly useful. They enable you to compute the mean and standard deviation within a moving window, making it easier to identify anomalies as trends change.

df['Rolling_Mean'] = df['Values'].rolling(window=3).mean()
df['Rolling_Std'] = df['Values'].rolling(window=3).std()

Identify anomalies

anomalies = df[(df['Values'] > (df['Rolling_Mean'] + 2 * df['Rolling_Std'])) | (df['Values'] < (df['Rolling_Mean'] - 2 * df['Rolling_Std']))]

print(anomalies)

For example, if you are tracking sensor data over time, this method can alert you to fluctuations that could indicate malfunctioning equipment, allowing for timely maintenance before issues worsen. In the case of our sample data though, it did not detect any anomalies.

5. Isolation Forest Algorithm

For those venturing into machine learning, the Isolation Forest algorithm is a powerful tool for identifying anomalies, especially in large datasets.

First, ensure you've installed the scikit-learn library. The Isolation Forest isolates anomalies by randomly selecting features and split values, which allows it to capture unusual patterns.

from sklearn.ensemble import IsolationForest

model = IsolationForest(contamination=0.1)

df['Anomaly'] = model.fit_predict(df[['Values']])
anomalies = df[df['Anomaly'] == -1]
print(anomalies)

This method is particularly effective for datasets over tens of thousands of entries, where traditional methods may struggle to pinpoint anomalies.

6. Local Outlier Factor (LOF)

The Local Outlier Factor (LOF) is another advanced technique that evaluates the local density of a data point compared to its neighbors. Points that exhibit significantly lower density are flagged as anomalies.

from sklearn.neighbors import LocalOutlierFactor

lof = LocalOutlierFactor()
df['Anomaly'] = lof.fit_predict(df[['Values']])
anomalies = df[df['Anomaly'] == -1]

print(anomalies)

For instance, if you're analyzing traffic patterns, LOF can help identify unusual spikes or drops in volume, alerting you to potential incidents or data inaccuracies. In our case, the sample data did not produce any anomaly values.

7. Visualization Techniques

Visual tools can significantly enhance your understanding of data and help identify anomalies. Consider using scatter plots or pair plots to effectively inspect your data visually.

# Calculate upper and lower bounds
# For example, using mean ± standard deviation
upper_bound = df['Values'].mean() + df['Values'].std()
lower_bound = df['Values'].mean() - df['Values'].std()

# Create the scatter plot
plt.scatter(df.index, df['Values'])
plt.title('Scatter Plot of Values')
plt.xlabel('Index')
plt.ylabel('Values')

# Add horizontal lines for bounds
plt.axhline(y=upper_bound, color='r', linestyle='-')
plt.axhline(y=lower_bound, color='r', linestyle='-')

plt.show()

According to studies, visualizations can lead to discoveries that raw data and metrics may not fully reveal. They provide insights that allow you to quickly identify patterns and outliers in complex datasets.

Putting It All Together: Pandas Data Analysis

Recognizing anomalies is crucial for accurate data analysis, especially in sectors where precision is vital. While tools like box plots and Z-score analysis are great for initial inspections, advanced methods such as Isolation Forest and Local Outlier Factor excel in complex analyses.

Experiment with these techniques, adapt them to your datasets, and sharpen your data analysis skills using Pandas. With these tools, you will be well-equipped to spot anomalies efficiently and make well-informed data-driven decisions.

pandas hunting for more data anomalies — Pandas hunting for more data anomalies