As a transitional step, this site will temporarily be made Read-Only from July 8th until the new community launch. During this time, you can still search and read articles and discussions.

While the community is read-only, if you have questions or issues requiring TIBCO review/response, please access the new TIBCO Community and select "Ask A Question."

You will need to register or log in or register to engage in the new community.

How to build an Autoencoder Anomaly Detection model that will generalize to new datasets

Published:
2:15pm Sep 22, 2021

Introduction

OK, you’ve built a model and it’s great! The fitting function does not return any errors, you get predictions, the loss function looks small-ish, all is good.

Right?

Not quite!

The hard (interesting) part is just starting. The key question is: will you still love me tomorrow i.e, how well will your model work when a new set of data comes in? Does it generalize? Will it work with new data? This is where we move from plumbing to philosophy.

How can we possibly know whether our model will work on data we haven’t seen yet? Obviously, any model can fail if the new data is totally different from our training data. We have to make some assumptions in order to move forward.

This gets even more interesting in the case of an autoencoder model we want to use for anomaly detection. If we overfit, we may not see any outliers at all; and if we go too far in the other direction, we may create a lot of false positives - identifying anomalies where there are none. 

Let's start with some basics however! What is an anomaly?

A Quick Introduction to Anomalies

Anomaly detection is a way of detecting abnormal behavior. One definition of anomalies is data points that do not conform to an expected pattern compared to the other items in the data set. Anomalies are from a different distribution than other items in the dataset. Anomalies in data translate to significant (and often critical) actionable information in a wide variety of application domains.

You can read our full Wiki on anomalies and how to detect them here and TIBCO has published an e-book beginners guide also

Building an Autoencoder Anomaly Detection model in TIBCO Spotfire®

Through Spotfire's interactive and visually driven integration with Python and TERR® (enterprise R), we are able to provide the full data science lifecycle of building an anomaly detection model in Spotfire that is intuitive, and usable by many different personas. 

To this end, TIBCO has published an Autoencoder Template performing Time Series anomaly detection and root cause analysis and an Autoencoder Python Data Function both in TIBCO Spotfire. They use TensorFlow (2.5.0) with the Keras API for Python implementations of the autoencoder. These act as great starter templates for anyone interested in exploring the potential of these techniques and models. 

Fit versus Generalization

This dichotomy is also known as the bias versus variance tradeoff. A model that is overfit is subject to a lot of variance when we train on different training data. This means that when presented with new data, it can fail badly and produce inaccurate results.  A less flexible model includes a bias in its assumptions, and can miss out on important features in the data. We strive for a balance between these two extremes, in particular, a balance suitable to our use case.

Validation Samples

How do we approach the problem of creating a model that works on unseen data? A simple and widely used method is to create a hold-out sample or validation sample. We fit the model to a training sample, which excludes data from our hold-out sample. Then we can evaluate the model performance on the hold-out sample. This can be done repeatedly with different settings for the hyperparameters in an attempt to optimize these settings.

Possible Sampling Pitfalls and Solutions

The validation sample may contain outliers, even extreme ones. This can obscure the best selection of a model.  One strategy that may help with this is cross-validation, although this can be time consuming when large samples are needed.

The usual procedure is to take a simple random sample of the available observations. Depending on the use case, this can be problematic: data collected over time tends to be similar at similar time periods, and the hold-out sample is now invisibly correlated with the training sample. Overfitting can then be more difficult to detect. Geographic data has a similar challenge.

In Keras, we can evaluate a validation set during training by either specifying the exact validation data or randomly sampling the training data. We monitor the validation loss at the end of each epoch to adjust model weights during backpropagation, help find the optimal model, and decide when to stop training.

For time series, it is advantageous to use an “out-of-time” validation sample. Here’s a Spotfire interface that makes this simple, from our Autoencoder template where we pass in explicit validation data:


Using Spotfire's brushlinking capabilities, we can interactively mark the time period and specify which sample you want to assign e.g. whether this time period is for training the model, testing the model or validation

Other methods to help with this include:

  • Detrending - by calculating differences in predictors by time.
  • Striping - most easily explained with a visual example:

From above, we can see that we have selected multiple training periods which are separated by test periods. This method can help prevent overtraining and make our model more generalized and therefore better able to predict new data.

  • Regularization

This refers to methods that tend to reduce overfitting. As mentioned earlier, these often depend on hyperparameters that we can tune to get the smallest possible validation error (remembering that these are stochastic estimates and we need to take that into account as well). Chief among these is early stopping: both ensembles and neural networks are trained in sequence, adding trees or modifying weights as we train.  This is conveniently visualized using Learning Curves such as the following:

The training error shown in blue above, decreases with more training epochs. The validation error also decreases, but is greater than the training error, as we expect as this is data the model has not seen before. With additional training, the validation error will often start increasing: this indicates overfitting, as we can see in this next visualization:

Both losses are already quite low, so in Spotfire we can use the Y-axis slider to zoom-in on the curves to see the errors in more detail. Notice that after ~400 epochs, the validation error increases as the training error continues to decrease, on average. The vertical lines are at the epochs of the minimum training and validation losses. Depending on which loss we monitor, we end up with two quite different models.

The learning curve thus provides a useful summary of the training process and where it may go awry. In effect, it is similar to a grid search in a single dimension, telling a compelling story in a visual way. It shows the utility of early stopping - once the validation error is clearly trending higher, further training is not likely to be useful. 

For autoencoders, we have another way to visualize our fit: a histogram of the reconstruction errors. A useful autoencoder will have a distinct, low frequency right tail of outliers that are not well reconstructed by our model. When we select the model attained at the minimum training loss (i.e. have overfitted to the training data, not recommended), we have a more skewed distribution of reconstruction errors with higher outlier errors:

When we select the model attained at the minimum validation loss, we observe :

When configuring the bottleneck layer for our autoencoder, this type of plot is helpful as well. 

Other Useful Methods for Regularization

  • Dropout 

    • This method for neural networks consists of omitting a random subset of neurons during each training step, slowing the overall training process.

  • L1 and L2 penalties

    • Modify the loss function to favor models with smaller coefficients. 

    • Weight Decay is a variant used with the Adam optimizer.

  • Smaller, simpler networks

Additional Notes for Autoencoders in Particular

An autoencoder is a particular type of neural network that attempts to reproduce its multidimensional input as output; but does so with a bottleneck layer that reduces the dimensionality of at least part of the model.  As a result, the model fails to reproduce some of its inputs, and this provides a method of recognizing unusual or anomalous cases.

When we use autoencoders for anomaly detection, we may not want to minimize the validation error as much as we can; sometimes we would like to see the outliers from a simpler model that doesn’t minimize our validation error but does provide a good balance between false outliers and outliers we can potentially have in our mode

Advanced Issues: Double Descent

Here’s a link that you may find interesting - concerning edge issues where continued training can overcome overfitting.

4 – The Overfitting Iceberg – Machine Learning Blog | ML@CMU | Carnegie Mellon University

Autoencoder Template Implementation

The Spotire autoencoder template takes a specific approach to using autoencoders for anomaly detection.

  • Oriented towards the analysis of manufacturing processes over time, it models cross-sectional slices of data as snapshots of the values over time. 
  • The time dimension is addressed via post-processing of the reconstruction errors.
  • Reconstruction errors are decomposed to a vector of per-variable values, whose time series can provide added insight.
  • Anomalies that persist over time are highlighted as incidents of interest. This is done using a variant of the Western Electric Rules widely used in SPC.
  • With sufficient data, it is possible to cluster these incidents into similar groups whose occurrence can be analyzed retrospectively, or, in conjunction with streaming tools like TIBCO Streambase, can be monitored in real time.

For general information on the Spotfire Anomaly Detection Template

Authors

David Katz is a Principal Consultant at TIBCO. With a long career in data analysis, model building and statistical consulting, David enjoys tackling challenging problems with real-world benefits, in particular using advanced regression methods and making the invisible visible. The most fun is the variety of applications he has been able to work with, from Formula One racing to marketing and operations. In his spare time he likes to bike, hike and do yoga.