BrandVerity has been doing dev hackdays every 6 weeks for a long time now. For this particular hackday, I was interested in playing with unsupervised anomaly detection, a concept I first encountered in this really great Andrew Ng machine learning class on coursera.

At a high level, anomaly detection can be thought of as automatically identifying outliers in a dataset.

One example of a problem we have that seemed like a great candidate for anomaly was attempting to identify bad proxies. We use a large fleet of proxies to do the vast majority of our web crawling. Identifying when one proxy is misbehaving can sometimes be tricky! If a proxy is completely unresponsive to all requests, that's pretty easy to catch. But what if the proxy is just partially failing for some requests? That failure could be normal -- for instance if the target server being accessed via the proxy is down -- or it could mean the proxy itself is a problem.

Enter the probabilistic anomaly detection technique! The basic concept here is to build a probabilistic model for the various features in your dataset, and then multiply things out to end up with an overall probability of how likely it is a given data point is an anomaly vs ‘normal’. This technique has some neat advantages: it can both identify the situation of a single feature being way off from the norm, as well as the situation where every feature is slightly off. The latter situation can be a lot trickier to catch with more naive approaches. One simple way to model this probability is with a Gaussian distribution, where the mean and standard deviation are trivially calculated from the data.

An example feature for our proxy problem might be the percentage of requests that returned a 200 (success) HTTP status code. This graph shows a histogram of the raw data for this feature, overlaid with a gaussian distribution with a mean and standard deviation computed from the data.

As a human we can easily see the outlier on the left. This is also born out by the value of the distribution at that point (it will be a number incredibly close to zero, almost no chance of this value being normal).

With all of the features modeled, and all probabilities calculated and multiplied, we end up with a distinct probability score for each proxy. This graph shows the probability score for each proxy over time:

This graph has a log-y axis of probability. You can see the bottom green line, which is WAY down from everyone else, was indeed a problem proxy (and lined up with the outlier in the 200 HTTP status code feature in the previous graph). By tuning in a threshold here, you can alert on likely anomalies with various false positive/negative rates depending on your tolerances/requirements. Another nice advantage of this technique, which really helps speed up the initial investigation, is that you can use the per-feature probabilities to help explain why a given proxy is considered an anomaly, as opposed to just getting a black box answer out.

For some example Python code which you can use to compute these probabilities and try this technique out yourself, I've uploaded an IPython notebook here: https://gist.github.com/aschokking/ad4be38f1080678c637f318ef086896e

We've been using this technique in production for a while now, and it has proven very robust and useful. I'm looking forward to applying it to more and more complex problems in the future!