Intro to Random Cut Forest Algorithm

5 minute read

What is an Anomaly?

Anomalies are the data points that have something unusual about them. Defining unusual however is not always easy. For example, the image shown below contains an anomaly that is pretty easy to spot. All the characters in the image are capital S’s with the exception of the single number 5.

But what about the image shown in the right? The anomaly is less easy to spot. There are actually two anomalies in this dataset. The first is easy to spot as there is a number 5, but the less obvious one is that only characters that appear in pairs are vowels. This kind of anomaly is almost impossible for a human to identify, but, given enough data, a Machine Learning algorithm could find it.

One of the challenges you might have noticed in identifying anomalies in the above figure, is that you did not know what type of anomaly you were looking for. If you had been told that the anomaly had to do with numbers versus letters, you would have easily identified the 5 and two adjacent X X (or non-vowels). We will not define any rules to help the model determine what is an anomaly. Instead the model figures out for itself, which is why detecting anomalies is an unsupervised learning problem.

A simple way to understand how Random Cut Forest works

Let’s take a high-level overview of how the algorithm actually works. RCF is an unsupervised learning algorithm for detecting anomalies in numeric high dimensional data streams or data points.

For simplicity, lets assume that we have a few data points in a two-dimensional space. We can clearly see the outlier data points in orange.

The algorithm associates an anomaly score with each data point in the dataset. This score is based on how far it is from the normal cluster of data points.

  • Low score values indicate that the data point is normal.
  • High score values indicates the presence of an anomaly in the data.

The definition of low/high depends on the application itself, but common practices suggests that score beyond 3 standard deviations from the mean score are considered anomalous.

How Random Cut Forest calculates the score?

The main idea behind the RCF algorithm is to create a forest of trees. For example, let’s assume that we have two-dimensional data points in a space with a majority of data lies in clusters except for one anomalous data point.

  1. The first step is to compute a bounding box of data, and the way we do that is by trying to find the min and max values of each dimension.
  2. Second, we select one of the dimensions and we do a cut randomly anywhere at range through that dimension. In our example here, we did a random cut through the X dimension.
  3. Then we compute another bounding box for the data points on the right hand side, and the same for data points on the left hand side. We do the cut again for the bounding boxes. If there is a point that’s an outlier, then there is a very high probability that this point will be closer to the root of the tree. And the reason for that is because it will be cut and become isolated very early in the process.
  4. We keep doing this, until every point in the tree is completely isolated.

The closer the point to the root, the higher the score it gets.

Amazon SageMaker Random Cut Forest (RCF) Algorithm

Amazon SageMaker Random Cut Forest (RCF) is an unsupervised algorithm for detecting anomalous data points within a data set. These are observations which diverge from otherwise well-structured or patterned data.

Anomalies can manifest as unexpected spikes in time series data, breaks in periodicity, or unclassifiable data points.

Figure shows anomalous data points

They are easy to describe in that, when viewed in a plot, they are often easily distinguishable from the “regular” data. Including these anomalies in a data set can drastically increase the complexity of a machine learning task since the “regular” data can often be described with a simple model.

With each data point, RCF associates an anomaly score. Low score values indicate that the data point is considered “normal.” High values indicate the presence of an anomaly in the data.

Example dataframe showing the anomaly score associated with each row of a multivariate time series.

The definitions of “low” and “high” depend on the application but common practice suggests that scores beyond three standard deviations from the mean score are considered anomalous.

While there are many applications of anomaly detection algorithms to one-dimensional time series data such as traffic volume analysis or sound volume spike detection, RCF is designed to work with arbitrary-dimensional input. Amazon SageMaker RCF scales well with respect to number of features, data set size, and number of instances. As shown in the above example, we are using six time series together to detect anomalies.

Why is it called a Random Cut Forest?: Random Cut Forest is a very descriptive name because the algorithm takes a bunch of random data points (Random), cuts them to the same number of points and creates trees (Cut). It then looks at all of the trees together (Forest) to determine whether a particular data point is an anomaly: Random Cut Forest.

A tree is an ordered way of storing numerical data. The simplest type of tree is called a binary tree. It’s a great way to store data because it’s easy and fast for a computer to use. To create a tree, you randomly subdivide the data points until you isolate the point you’re testing to determine whether it’s an anomaly. Each time you subdivide the data points, it creates a new level of the tree.

The fewer times you need to subdivide the data points before you isolate the target data point the more likely it is that the data point is an anomaly for that sample of data.

The final step performed by the Random Cut Forest algorithm is to combine the trees into a forest. If lots of the samples have small trees then the target data point is likely to be an anomaly. If only a few of the samples have small trees then it’s unlikely to be an anomaly.