My goals in this talk:
In the academic literature, there is some ambiguity in the definitions of outliers and anomalies. Some authors mean them to be the same and other authors differentiate the two terms. I will follow the latter practice.
An outlier is something sufficiently different from the norm that we notice it.
An anomaly is an outlier of interest to humans.
Let's dive further into general concepts and technical definitions.
The non-technical definition of an anomaly is essentially “I’ll know it when I see it.” This can get muddled at the edges, but works really well because humans are great at pattern matching and picking out things which look dissimilar.
One of the best collections of information about how we process things visually is the Gestalt school of psychology. Their key insight is that our minds apply known and expected patterns to what our eyes see.
This leads to a few key Gestalt principles we can take advantage of.
We naturally fill in gaps and turn partial shapes into whole shapes.
We group things together based on their being inside or outside of a region.
We prefer to see the foreground rather than the background. Exceptions do exist, such as Rubin's vase:
Things which are nearer to each other are considered part of the same grouping, and "abnormal" separation creates new groups in our minds.
We group things together based on color, shape, and size.
We want to follow the smoothest path when viewing lines.
By contrast, this is a discomforting pattern because it breaks continuity.
We perceive ambiguous shapes in as simple a manner as possible. What is this?
Our minds put together that it's a mixture of multiple, slightly overlapping shapes.
We do this because we've never seen a character looking like this, and so don't think of the complex shape as "one" thing.
By contrast...
Because humans are pattern-matchers who try to apply fairly simple heuristics to visual inputs, we tend to see things that aren’t there. People can take advantage of this with optical illusions, but it also lets us make cogent observations.
Our eyes try to fit a line to the scatterplot and tell us direction and magnitude. And they also make us wonder about those two outliers dragging down our best-fit line.
A layman’s concept of anomalies is great, but it is ambiguous. Some things which might look strange actually aren’t anomalous behavior, whereas some anomalies might look reasonable from a first glance.
A process control chart gives us an understanding of when a process is working within normal parameter ("in control") and when it escapes those confines and goes "out of control."
When hunting for anomalies, we want data sets which have the following properties:
Betteridge’s Law of Headlines says no.
Time series data is used extremely frequently for tracking anomalies because anomalies tend to be temporal in nature. But you can use the same techniques when looking at cohorts within a given time frame.
This data was all collected in one time period, and yet we can envision a way to detect anomalies.
In this case, the assumption is that all members in a cohort should have the same operation function.
There are dozens of anomaly detection techniques available to us. Some commonalities among techniques are:
The standard deviation is a calculation of variance in our data.
For normal distributions:
Standard deviation is sensitive to outliers.
stdev({7.3, 8.2, 8.4, 9.1, 9.3, 9.6}) = 0.85.
Mean = 8.65
stdev = 0.85, mean = 8.65.
Now let's add one more datapoint:
stdev({ 7.3, 8.2, 8.4, 9.1, 9.3, 9.6, 1.9}) = 2.67.
Mean = 7.69
One outlier increases standard deviation considerably.
stdev = 2.67, mean = 7.69.
This also causes us to ignore otherwise-abnormal values like 5.1:
95% = mean +/- (2 * stdev)
Original 95% = 8.65 +/- 2*0.85 = [6.95, 10.35]
New 95% = 7.69 +/- 2*2.67 = [2.35, 13.03]
5.1 was caught by the original model but the new model thinks it's just fine.
Median Absolute Deviation is a robust statistic: it can handle a limited number of outliers without breaking down.
$MAD = median(|X_i - \widetilde X|)$Using the original dataset from before, let's calculate median and MAD.
Eliminate the extremes until you get to the center 1-2 elements.
X = {7.3, 8.2, 8.4, 9.1, 9.3, 9.6}. Median = 8.75
MAD = med({1.45, 0.55, 0.35, 0.35, 0.55, 0.85}) = 0.55.
Median = 8.75, MAD = 0.55
Now let's add that outlier:
X2 = {7.3, 8.2, 8.4, 9.1, 9.3, 9.6, 1.9}). Median = 8.4
MAD = med({0.7, 0, 1.1, 0.2, 0.9, 1.2, 6.5}) = 0.9.
Old median = 8.75, old MAD = 0.55
New median = 8.4, new MAD = 0.9
3 * MAD is a good rule of thumb. Both of these would catch 5.1 as an outlier value.
8.75 - 3*0.55 = 7.1
8.4 - 3*0.9 = 5.7
Suppose we have a trend with an anomalous jump. How do we separate the anomaly increase from the trend?
De-trend: fit the data with a line...
De-trend: fit the data with a line and track the difference from the line.
Changepoint detection looks for abrupt shifts in time series data.
Another common technique is to measure the difference between points and perform statistical analysis on those differences.
We can perform all of the same analyses on deltas that we do on raw values.
Here are a few examples of pre-written packages for anomaly detection:
If you decide to build your own anomaly detection process, check out MathNet.
MathNet is a series of .NET libraries for numerical and statistical analysis.
This allows you to customize the statistical tests to run and generate results very quickly in C# or F# code.
Many of these sorts of tests are one-liners with MathNet.Numerics.
Another alternative is to use anomaly detection within the ML.NET package.
ML.NET is an actively-developed library for machine learning within .NET and supports both F# and C#.
Prep steps in Visual Studio Code or at the command line:
The Azure Cognitive Services Anomaly Detector API allows you to perform anomaly detection from any language which supports hitting REST APIs.
Steps:
Although we have libraries like ML.NET which provide anomaly detection, you can also use the Anomaly Detection API in your C# or F# code.
Prep steps in Visual Studio Code or at the command line:
Over the course of this talk, we have looked at the concept of anomalies, some techniques for detecting them, and .NET packages to make it easy.
To learn more, go here:
https://csmore.info/on/anomalies
And for help, contact me:
feasel@catallaxyservices.com | @feaselkl
Catallaxy Services consulting:
https://CSmore.info/on/contact