In the academic literature, there is some ambiguity in the definitions of outliers and anomalies. Some authors mean them to be the same and other authors differentiate the two terms. I will follow the latter practice.
An outlier is something sufficiently different from the norm that we notice it.
An anomaly is an outlier of interest to humans.
Let's dive further into general concepts and technical definitions.
The non-technical definition of an anomaly is essentially “I’ll know it when I see it.” This can get muddled at the edges, but works really well because humans are great at pattern matching and picking out things which look dissimilar.
A layman’s concept of anomalies is great, but it is ambiguous. Some things which might look strange actually aren’t anomalous behavior, whereas some anomalies might look reasonable from a first glance.
For an in-depth look at the topic of outliers, check out https://csmore.info/on/anomalies for a full-length introduction to outlier detection.
Univariate = One variable
We have a thing whose behavior we want to understand and the only thing we have available is a stream of data points. This stream may or may not be part of a time series--if so, that is known as univariate time series analysis.
A variety of techniques are available for univariate outlier detection. Most of these techniques incorporate an analysis of the distribution of the data: how frequently we see certain values, as well as how close a given value is to others.
Fundamentally, multivariate outlier detection is an extension of univariate outlier detection. We still want to find points which are different from the norm.
A key difference is that distributions tend to be much more complex as different variables correlate to one another.
Here, we have two variables which we expect to correlate fairly well.
If we expect correlation, we can regress one against the other and look for points well off of a given line.
Univariate outlier detection almost always focuses around numeric data. With multivariate outlier detection, we still emphasize numeric variables but might also have non-numeric variables. The problem is that all of our techniques expect numeric inputs, not strings or other non-numeric types.
There are a few ways to deal with this problem:
For simplicity, we'll send in numeric data only.
Typically, however, we don't expect such close correlation in variables and instead want to analyze clusters of data.
Intuitively, outliers are points which are sufficiently "far" from other points (or the majority of points)--if there were a lot of points clustered closely together, they'd be inliers!
In practice, we have techniques which help us formalize this intuition.
Distance is a great indicator of whether a point is an outlier.
But it isn't the only one. In this case, A is further from the middle than B, but A is clearly part of a cluster whereas B is outside its own cluster.
Typically, we calculate density as distance to the k nearest neighbors. As a quick note, these values are NOT transitive! The two nearest neighbors of the black dot do not include the black dot in their two nearest neighbors.
Local Outlier Factor (LOF) is a density-based approach to finding outliers in a dataset.
We want to measure the local density of any given data point and see how it compares to its k nearest neighbors.
The neighborhood size (k) here is 5. We find the distance to the kth point and all points inside that zone have the same reachability distance.
For any point outside that neighborhood, we calculate the distance from x to that point (e.g., b).
The reachability density is the reciprocal of the average reachability distance.
Finally, the Local Outlier Factor is the ratio of local reachability densities of all other data points versus x. The higher the value, the more likely this is an outlier.
There is no exact value of "What value indicates an outlier?" Instead, we often track the top few percent (e.g., top 5%) and report those as outliers.
LOF sounds great but breaks down when densities change. The black dot is an obvious outlier but the cutoff point for it also marks all the points in the circle as outliers, even though they aren't!
Connectivity-Based Outlier Factor (COF) is an attempt to deal with the relative density problems in LOF.
To do so, it adds isolativity in addition to density, where isolativity is the degree to which an object is connected to other objects.
First, build a chain of length k, linking the current chain to to the next-nearest neighbor. Here, k = 6, covering a total of 7 nodes.
Then, perform a weighted calculation which emphasizes the lower-numbered links more than the higher-numbered links. This gives us the COF value. Points with higher COF values are more likely to be outliers.
The PyOD
library has an implementation of COF we can use.
COF is definitely better than LOF and is a useful part of an ensemble. The problem is that COF does not provide specific guidance on what is an outlier.
The closest we have is an ordered list (letting us choose the top X% by score) and the ability to create our own cutoff.
Local Correlation Integral (LOCI) is intended to solve this problem by creating an automated cut-off to determine whether a given point is an outlier.
LOCI does not ask for k nearest neighbors. It instead asks for a neighborhood size alpha
, representing a max distance between neighbors for our calculation.
0 < alpha < 1
and the typical default is alpha = 0.5
.
The sampling neighborhood r
around a point p1
represents all the points to which we will compare p1
for normality.
The counting neighborhood alpha * r
around a point p1
represents how closely packed the points around p1
are.
In this case, p1
's counting neighborhood size is 1. p2
's size is 3 and p3
's size is 5. p2
and p3
are in p1
's sampling neighborhood so will be the two direct comps, so the mean of alpha*r
neighbors is 3.
The formula for Multi-Granularity Deviation Factor (MDEF) is 1 - (my counting neighborhood size)/(average counting neighborhood size for my sampling neighborhood)
.
For p1
, that is 1 - (1 / 3) = 2/3
.
For p2
, that is 1 - (3 / 3) = 0
.
For p3
, that is 1 - (5 / 3) = -2/3
.
We also want to calculate the standard deviation of MDEF, which is calculated as (standard deviation of counting neighborhood sizes in my sampling neighborhood)/(average counting neighborhood size for my sampling neighborhood)
.
This gets us to our final parameter: k
, representing the number of standard deviations from MDEF before we declare an outlier. By convention, this is 3.
If the set of neighborhood sizes follows a normal distribution, fewer than 1% of data points will trigger when k = 3
.
The PyOD
library has an implementation of LOCI we can use.
It is, however, very slow. Therefore, we will only use this when the data set is sufficiently small (e.g., less than 1000 data points).
LOF, COF, and LOCI are all clustering-based approaches to outlier detection: we look for points sufficiently distant from other points as outliers.
Another technique is to use copulas.
A copula is a link. In language, copulas are links between subject and predicate, such as the helper verb "to be."
In statistics, a copula is "a probability model that represents a multivariate uniform distribution, which examines the association or dependence between many variables" (Kenton).
Imagine a multivariate dataset. Each variable in the dataset may follow its own probability distribution.
In addition, variables may be correlated. In other words, there is a joint probability distribution between some set of variables in our dataset.
How do we model the joint probability distribution of any combination of these variables?
We transform a joint probability distribution into several marginal distributions (aka marginals). These marginals are all independent and therefore not correlated with other variables.
A copula is a function which transforms a single joint distribution to multiple marginal distributions.
The PyOD
library also has an implementation of COPOD we can use.
One important consideration here is that the scale of COPOD differs significantly from COF and LOCI. Furthermore, it changes based on the median of the dataset.
The last step in the process is for a human to understand whether any given outlier is actually an anomaly and act upon it. We have built an outlier detection engine but people turn it into an anomaly detector.
As a bonus, if you allow people to mark outliers as anomalous or not, you can use that data to train a future model, possibly one which is considerably better than what a general-purpose outlier detection engine will give you.
To learn more, go here:
https://csmore.info/on/multiad
To get a copy of the book, go here:
Finding Ghosts In Your Data
And for help, contact me:
feasel@catallaxyservices.com | @feaselkl
Catallaxy Services consulting:
https://CSmore.info/on/contact