Techniques for detecting stationarity in Time Series.

9 minute read

Time series models

Goal is to learn how to predict the future of time series. Time series data is everywhere in this world. It is used in a wide variety of fields. There are many datasets for which we would like to be able to predict the future. Knowing the future of obesity rates could help us intervene now for public health; predicting consumer energy demands could help power stations run more efficiently; and predicting how the population of a city will change could help us build the infrastructure we will need.

We can forecast all of these datasets using time series models, and ARIMA models are one of the go-to time series tools. BUT, before we can dive into modeling, we need to make the time series stationary.

To model a time series, it must be stationary!

Stationary means that the distribution of the data doesn’t change with time. For a time series to be stationary it must fulfill three criteria. These are:

  • The series has zero trend, it isn’t growing or shrinking, i.e, trending up or down.
  • The variance is constant. The average distance of the data points from the zero line isn’t changing.
  • And the autocorrelation is constant. How each value in the time series is related to its neighbors stays the same.

How to split a time series using train-test-split?

Generally, in machine learning, you have a training set which you fit your model on, and a test set, which you will test your predictions against. Time series forecasting is just the same. Our train-test split will be different however. We use the past values to make future predictions, and so we will need to split the data in time. We train on the data earlier in the time series and test on the data that comes later. We can split time series at a given date using pandas slicing.

# Import modules
import pandas as pd
import matplotlib.pyplot as plt

# Load in the time series
candy = pd.read_csv('data/candy_production.csv',
            index_col='date',
            parse_dates=True)

# Plot and show the time series on axis ax
fig, ax = plt.subplots(figsize=(10,5))
candy.plot(ax=ax)
<matplotlib.axes._subplots.AxesSubplot at 0x7f8c6918e160>

png

We take the candy production dataset and split it into a train and a test set. The reason to do this is so that you can test the quality of your model fit when you are done. Split the time series into train and test sets by slicing with datetime indexes. Take the train set as everything up to the end of 2006 and the test set as everything from the start of 2007.

# Split the data into a train and test set
candy_train = candy.loc[:'2006']
candy_test = candy.loc['2007':]

# Create an axis
fig, ax = plt.subplots(figsize=(10, 5))

# Plot the train and test sets on the axis ax
candy_train.plot(ax=ax)
candy_test.plot(ax=ax)
plt.show()

png

Great! Take a look at the plot, do you think that you yourself could predict what happens after 2006 given the blue training set. What happens to the long term trend and the seasonal pattern?

Is it stationary?

Identifying whether a time series is stationary or non-stationary is very important. If it is stationary you can use ARMA models to predict the next values of the time series.

If it is stationary you can use ARMA models to predict the next values of the time series. If it is non-stationary then you cannot use ARMA models, however, as you will see in this post, you can often transform non-stationary time series to stationary ones.

A naive/subjective way of checking for stationarity is by looking at the plot. For example, take a look at these 3 time series:

  • The top plot has a trend. So it is not stationary.
  • The middle plot’s variance changes with time. So it is not stationary.
  • The bottom plot appears to be stationary. Since we can’t see any trend, or any obvious changes in variance, or dynamics. This time series looks stationary.

Statistical test to detect stationarity

Now that we learned about ways in which a time series can be non-stationary, and how we can identify it by plotting, lets now look at more formal ways of accomplishing this task, with statistical tests.

The most common test for identifying whether a time series is non-stationary is the augmented Dicky-Fuller test.

ADF test is a statistical test, where the NULL HYPOTHESIS is that your time series is NON-STATIONARY due to trend.

As an example we will run the augmented Dicky-Fuller test on the earthquakes time series to test for stationarity. The plot of the timeseries is shown below. It looks like it could be stationary, but earthquakes are very damaging. If you want to make predictions about them you better be sure.

earthquake = pd.read_csv('data/earthquakes.csv', index_col='date', parse_dates=True)

# Plot and show the time series on axis ax
fig, ax = plt.subplots(figsize=(10,5))
ax = earthquake['earthquakes_per_year'].plot(ax=ax)

png

# Import augmented dicky-fuller test function
from statsmodels.tsa.stattools import adfuller

# Run test
result = adfuller(earthquake['earthquakes_per_year'])

# Print test statistic
print(result[0])

# Print p-value
print(result[1])

# Print critical values
print(result[4])
-3.1831922511917816
0.020978425256003668
{'1%': -3.5003788874873405, '5%': -2.8921519665075235, '10%': -2.5830997960069446}

The results object is a tuple. The zeroth element is the test statistic, in this case it is -3.18. The more negative this number is, the more likely that the data is stationary. The next item in the results tuple, is the test p-value. Here it’s 0.02. If the p-value is smaller than 0.05, we reject the null hypothesis and assume our time series must be stationary. The last item in the tuple is a dictionary. This stores the critical values of the test statistic which equate to different p-values. In this case, if we wanted a p-value of 0.05 or below, our test statistic needed to be below -2.89.

In this case, you can reject the null hypothesis that the time series is non-stationary. Therefore it is stationary. You probably could have intuited this from looking at the graph or by knowing a little about geology. The time series covers only about 100 years which is a very short time on a geological time scale.

Remember that it is always worth plotting your time series as well as doing the statistical tests. These tests are very useful but sometimes they don’t capture the full picture.

Remember also that Dicky-Fuller only tests for trend stationarity. For example, consider the plot shown below. Although the time series behavior clearly changes, and is non-stationary, it passes the Dicky-Fuller test.

Making a time series stationary

There are also ways to transform non-stationary time series into stationary ones. So let’s say we have a time series that is non-stationary. We need to transform the data into a stationary form before we can model it.

You can think of this a bit like feature engineering in classic machine learning.

Let’s start with a non-stationary dataset. Here is an example of the simulated population of a city, which has a clear trend. If you could predict the growth rate of a city then it would be possible to plan and build the infrastructure that the city will need later, thus future-proofing public spending. In this case the time series is fictitious but its perfect to practice on.

We will test for stationarity by eye and use the Augmented Dicky-Fuller test, and take the difference to make the dataset stationary.

import numpy as np
# Generate a series of 120 years
year = pd.Series(pd.date_range(start='1900', end='2020', freq='Y'))
# Generate a series of 120 values starting at 15 and ending at 115 in equal steps
population = pd.Series(np.arange(15, 115, 100/120))
# Generate 120 random steps with mean=2 and standard deviation=10
noise = np.random.normal(loc=2, scale=10, size=120)
# Add some noise to population, otherwise it will be a straight trend line.
population = population + noise
# Create a dataframe from both the series
city = pd.concat([year, population], axis=1)
# Name the columns
city.columns = ['year', 'city_population']

# Run the ADF test on the time series
result = adfuller(city['city_population'])

# Plot the time series
fig, ax = plt.subplots(figsize=(10,5))
city['city_population'].plot(ax=ax)
plt.show()

# Print the test statistic and the p-value
print('ADF Statistic:', result[0])
print('p-value:', result[1])

png

ADF Statistic: -0.04836301708265112
p-value: 0.9543450186278956

Recollect: ADF test is a statistical test, where the NULL HYPOTHESIS is that your time series is NON-STATIONARY due to trend.

The p-value=0.95 is greater than 0.05, so we fail to reject the null-hypothesis. Therefore, this time series is non-stationary w.r.t trend - since ADF only confirms trend stationarity.

Differencing

One very common way to make a time series stationary is to take its difference. This is where, from each value in our time series we subtract the previous value.

We can do this using the .diff() method of a pandas DataFrame. Note that this gives us one NaN value at the start since there is no previous value to subtract from it. We can get rid of this using the .dropna() method.

Here is the time series after differencing. This time, taking the difference was enough to make it stationary, but for other time series we may need to take the difference more than once.

# Calculate the first difference of the time series
city_stationary = city.diff().dropna()

# Run ADF test on the differenced time series
result = adfuller(city_stationary['city_population'])

# Plot the differenced time series
fig, ax = plt.subplots(figsize=(10,5))
city_stationary['city_population'].plot(ax=ax)
plt.show()

# Print the test statistic and the p-value
print('ADF Statistic:', result[0])
print('p-value:', result[1])

png

ADF Statistic: -8.919811618590764
p-value: 1.0485124656781588e-14

The p-value is very significant! This time series is now stationary and ready for modeling!

Sometimes you can try doing a second difference to make the time series stationary. It is not necessary in this case, but for illustration purposes, here it is.

# Calculate the second difference of the time series
city_stationary = city.diff().diff().dropna()

# Run ADF test on the differenced time series
result = adfuller(city_stationary['city_population'])

# Plot the differenced time series
fig, ax = plt.subplots(figsize=(10, 5))
city_stationary['city_population'].plot(ax=ax)
plt.show()

# Print the test statistic and the p-value
print('ADF Statistic:', result[0])
print('p-value:', result[1])

png

ADF Statistic: -6.396184261250273
p-value: 2.0481255683133966e-08

Other transformations to make time series stationary

Differencing should be the first transform you try to make a time series stationary. But sometimes it isn’t the best option. We will need to perform other transformations to make the time series stationary. This could be to take the log, or the square root of a time series, or to calculate the proportional change. It can be hard to decide which of these to do, but often the simplest solution is the best one.

A classic way of transforming stock time series is the log-return of the series. This is calculated as follows:

log_return(y(t)) = log(y(t)/y(t-1)); where y(t) -> amazon, y(t-1) -> amazon.shift(1) and log -> np.log()

amazon = pd.read_csv('data/amazon_close.csv', index_col='date', parse_dates=True)

# Calculate the first difference and drop the nans
amazon_diff = amazon.diff()
amazon_diff = amazon_diff.dropna()

# Run test and print
result_diff = adfuller(amazon_diff['close'])
print(result_diff)

# Calculate log-return and drop nans
amazon_log = np.log(amazon/amazon.shift(1))
amazon_log = amazon_log.dropna()

# Run test and print
result_log = adfuller(amazon_log['close'])
print(result_log)
(-7.2035794888112035, 2.3312717254877214e-10, 23, 1234, {'1%': -3.435660336370594, '5%': -2.863885022214541, '10%': -2.568018522153254}, 10764.626718933836)
(-34.915748536059674, 0.0, 0, 1257, {'1%': -3.4355629707955395, '5%': -2.863842063387667, '10%': -2.567995644141416}, -6245.723147672197)

Notice that both the differenced and the log-return transformed time series have a small p-value, but the log transformed time series has a much more negative test statistic (-34.915748536059674). This means the log-return tranformation is better for modeling.

Summary

In summary, this post showed how to test for stationarity and make time series stationary.

In the upcoming posts, we will see:

  1. How to fit ARIMA models?
  2. How to optimize ARIMA models?
  3. How to make forecasts of important real-world data? , and importantly
  4. How to find the limits (uncertainity) of your forecasts?