Time Series Window functions: Rolling and Expanding metrics

9 minute read

Window functions for time series in pandas

Window functions are useful because they allow you to operate on sub periods of your time series. In particular, window functions calculate metrics for the data inside the window. Then, the result of this calculation forms a new time series, where each data point represents a summary of several data points of the original time series.

In this post, we will learn about window functions for time series in pandas. We will discuss two main types of windows:

  • Rolling windows maintain the same size while they slide over the time series, so each new data point is the result of a given number of observations.
  • Expanding windows grow with the time series, so that the calculation that produces a new data point is the result of all previous data points.

Rolling Windows

Let’s start with air quality trends for New York City since 2010. In particular, we’ll be using the daily Ozone concentration levels provided by the Environmental Protection Agency to calculate & plot the 90 and 360 day rolling average.

import pandas as pd
import matplotlib.pyplot as plt

# Import and inspect ozone data here
data = pd.read_csv('data/air_quality_data/ozone_nyc.csv', index_col='date', parse_dates=True)
print(data.info())
data.head()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6291 entries, 2000-01-01 to 2017-03-31
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Ozone   6167 non-null   float64
dtypes: float64(1)
memory usage: 98.3 KB
None
Ozone
date
2000-01-01 0.004032
2000-01-02 0.009486
2000-01-03 0.005580
2000-01-04 0.008717
2000-01-05 0.013754

integer window size

When you choose an integer-based window size, pandas will only calculate the mean if the window has no missing values. For example when we use rolling(3), means that we use the current observation as well as the two preceding ones in order to calculate our desired metric (.mean()). So, in our case, the first two values will be NaN - since with integer-based window size pandas will not calculate the mean if there are any missing values. Next, for 2000-01-03, since there are no missing values in the preceding two rows, we see the value 0.006366

import numpy as np
np.mean([0.004032, 0.009486, 0.005580])
0.006365999999999999
data['Ozone'].rolling(3).mean()
date
2000-01-01         NaN
2000-01-02         NaN
2000-01-03    0.006366
2000-01-04    0.007928
2000-01-05    0.009350
                ...   
2017-03-27    0.020319
2017-03-28    0.016974
2017-03-29    0.017950
2017-03-30    0.024757
2017-03-31    0.032128
Name: Ozone, Length: 6291, dtype: float64

Note, you can change this default by setting the min_periods parameter to a value smaller than the window size of 3. For example, when we use min_periods=2, this means that pandas will calcuate mean for that data point even if 1 value is missing.

data['Ozone'].rolling(3, min_periods=2).mean()
date
2000-01-01         NaN
2000-01-02    0.006759
2000-01-03    0.006366
2000-01-04    0.007928
2000-01-05    0.009350
                ...   
2017-03-27    0.020319
2017-03-28    0.016974
2017-03-29    0.017950
2017-03-30    0.024757
2017-03-31    0.032128
Name: Ozone, Length: 6291, dtype: float64

string window size

We can also create windows based on a date offset. If you choose 30D, for instance, the window will contain the days when CO2 was not NaN during the last 30 calendar days. While the window is fixed in terms of period length, the number of observations will vary. Let’s take a look at what the rolling mean looks like.

In the below example, we are using rolling(3D), so pandas will look at the 3 preceding values, and if any of them are NaN, then it will use the non-missing values to calculate the mean. We can achieve the same by using min_periods=1 when using int-based window.

data['Ozone'].rolling('3D').mean()
date
2000-01-01    0.004032
2000-01-02    0.006759
2000-01-03    0.006366
2000-01-04    0.007928
2000-01-05    0.009350
                ...   
2017-03-27    0.020319
2017-03-28    0.016974
2017-03-29    0.017950
2017-03-30    0.024757
2017-03-31    0.032128
Name: Ozone, Length: 6291, dtype: float64

Alright, now that we understand the subtle difference between integer-windows and string-windows, lets get back to plotting the rolling means.

Rolling average air quality since 2010 for new york city

# Calculate 90d and 360d rolling mean for the last price
data['90D'] = data['Ozone'].rolling(window='90D').mean()
data['360D'] = data['Ozone'].rolling(window='360D').mean()
data.head()
Ozone 90D 360D
date
2000-01-01 0.004032 0.004032 0.004032
2000-01-02 0.009486 0.006759 0.006759
2000-01-03 0.005580 0.006366 0.006366
2000-01-04 0.008717 0.006954 0.006954
2000-01-05 0.013754 0.008314 0.008314
# Plot data
data['2010':].plot(figsize=(10, 10))
plt.title('New York City')
plt.show()

png

Do the different rolling windows help you see any long term trends that are hard to spot in the original data?

Rolling 360-day median & std. deviation for nyc ozone data since 2000

The daily data are very volatile, so using a longer term rolling average can help reveal a longer term trend.

# Import and inspect ozone data here
data = pd.read_csv('data/air_quality_data/ozone_nyc.csv', parse_dates=['date'], index_col='date').dropna()

# Calculate the rolling mean and std here
rolling_stats = data.Ozone.rolling(360).agg(['mean', 'std'])

# Join rolling_stats with ozone data
stats = data.join(rolling_stats)

# Plot stats
stats.plot(subplots=True, figsize=(10, 10))
plt.show()

png

stats.head()
Ozone mean std
date
2000-01-01 0.004032 NaN NaN
2000-01-02 0.009486 NaN NaN
2000-01-03 0.005580 NaN NaN
2000-01-04 0.008717 NaN NaN
2000-01-05 0.013754 NaN NaN

Rolling quantiles for daily air quality in nyc

Next we calculate the rolling quantiles to describe changes in the dispersion of a time series over time in a way that is less sensitive to outliers than using the mean and standard deviation.

Let’s calculate rolling quantiles - at 10%, 50% (median) and 90% - of the distribution of daily average ozone concentration in NYC using a 360-day rolling window.

# Import and inspect ozone data here
data = pd.read_csv('data/air_quality_data/ozone_nyc.csv', parse_dates=['date'], index_col='date')
print(data.info())
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6291 entries, 2000-01-01 to 2017-03-31
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Ozone   6167 non-null   float64
dtypes: float64(1)
memory usage: 98.3 KB
None

As we have some missing values, instead of using dropna() as we did above, we can use resample('D').interpolate() to fill in the values.

# Resample, interpolate and inspect ozone data here
data = data.resample('D').interpolate()
print(data.info())
data.head()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6300 entries, 2000-01-01 to 2017-03-31
Freq: D
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   Ozone   6300 non-null   float64
dtypes: float64(1)
memory usage: 98.4 KB
None
Ozone
date
2000-01-01 0.004032
2000-01-02 0.009486
2000-01-03 0.005580
2000-01-04 0.008717
2000-01-05 0.013754
# Create the rolling window
rolling = data['Ozone'].rolling(360)

# Insert the rolling quantiles to the monthly returns
data['q10'] = rolling.quantile(0.1)
data['q50'] = rolling.quantile(0.5)
data['q90'] = rolling.quantile(0.9)

# Plot the data
data.plot(figsize=(10, 10))
plt.show()

png

The rolling quantiles help show the volatility of the series.

Expanding Windows

Now we will move on from rolling to expanding windows. We will now calculate metrics for groups that get larger to include all data up to the current date. Each data point of the resulting time series reflects all historical values up to that point.

Expanding windows are useful to calculate for instance a cumulative rate of return, or a running maximum or minimum. Example: What is my cumulative return on a $1000 investment in google?

Expanding windows in pandas

In pandas, we can use either the method .expanding(), which works just like .rolling(), or in a few cases shorthand methods for the cumulative sum, product, min and max.

To understand the basic idea, take a look at this simple example: We start with a list of numbers from 0 to 4. You can calculate the same result using either the method expanding followed by the sum method, or apply the cumulative sum method directly You simply get a list where each number is the sum of all preceding values.

df = pd.DataFrame({'data': range(5)})
df['expanding_sum'] = df['data'].expanding().sum()
df['cumulative_sum'] = df['data'].cumsum()
df
data expanding_sum cumulative_sum
0 0 0.0 0
1 1 1.0 1
2 2 3.0 3
3 3 6.0 6
4 4 10.0 10

Difference b/w cumsum() and diff()

Earlier we have seen that expanding windows allow us to run cumulative calculations. The cumulative sum method has in fact the opposite effect of the .diff() method.

To illustrate this, let’s use the Google stock price time series, create the differences between prices, and reconstruct the series using the cumulative sum.

data = pd.read_csv('data/stock_data/google.csv', index_col='Date', parse_dates=True)
data = data.dropna()
print(data.shape)
data.head()
(756, 1)
Close
Date
2014-01-02 556.00
2014-01-03 551.95
2014-01-06 558.10
2014-01-07 568.86
2014-01-08 570.04

First, lets run a .diff() method. Notice how we get NaNs when atleast one of the values is missing.

differences = data.diff()
differences.head()
Close
Date
2014-01-02 NaN
2014-01-03 -4.05
2014-01-06 6.15
2014-01-07 10.76
2014-01-08 1.18
# drop the first value
differences = data.diff().dropna()

# Select start price
start_price = data.first('D')

start_price
Close
Date
2014-01-02 556.0
start_price.append(differences)
Close
Date
2014-01-02 556.00
2014-01-03 -4.05
2014-01-06 6.15
2014-01-07 10.76
2014-01-08 1.18
... ...
2016-12-23 -1.35
2016-12-27 1.64
2016-12-28 -6.50
2016-12-29 -2.26
2016-12-30 -10.97

756 rows × 1 columns

# Calculate cumulative sum
cumulative_sum = start_price.append(differences).cumsum()

# Validate cumulative sum equals data
print(data.equals(cumulative_sum))
True

How to calculate a running return?

The simple period return is just the current price divided by the last price minus 1.

The return over several periods is the product of all period returns after adding 1, and then subtracting 1 from the product. Pandas makes these calculations easy ‘ you have already seen the methods for percent change and basic math, and now you’ll learn about the cumulative product. To get the cumulative or running rate of return on the SP500, we will follow the steps described above in pandas.

data = pd.read_csv('data/stock_data/sp500.csv', index_col='date', parse_dates=True)
data.head()
SP500
date
2007-06-29 1503.35
2007-07-02 1519.43
2007-07-03 1524.87
2007-07-05 1525.40
2007-07-06 1530.44
pr = data['SP500'].pct_change() # period return
pr_plus_one = pr.add(1) # or pr + 1
cumulative_return = pr_plus_one.cumprod().sub(1)
cumulative_return.mul(100).plot(figsize=(10, 10))
plt.title('SP500 running rate of return')
Text(0.5, 1.0, 'SP500 running rate of return')

png

Looks like the SP500 is up 60% since 2007, despite being down 60% in 2009.

Running min and max of a timeseries

We can also easily get the running min and max of a time series. Just apply the expanding method and the respective aggregation method. The red and green line outline the min and max up to the current date for each day.

data['running_min'] = data['SP500'].expanding().min()
data['running_max'] = data['SP500'].expanding().max()
data.plot(figsize=(10, 10))
<AxesSubplot:xlabel='date'>

png

Cumulative return on $1,000 invested in google vs apple

Now let’s compare how much $1,000 would be worth if invested in Google ('GOOG') or Apple ('AAPL') in 2010.

data = pd.read_csv('data/stock_data/apple_google.csv', index_col='Date', parse_dates=True)
data.head()
AAPL GOOG
Date
2010-01-04 NaN 313.06
2010-01-05 NaN 311.68
2010-01-06 NaN 303.83
2010-01-07 NaN 296.75
2010-01-08 NaN 300.71
# Define your investment
investment = 1000

# Calculate the daily returns here
returns = data.pct_change()

# Calculate the cumulative returns here
returns_plus_one = returns + 1
cumulative_return = returns_plus_one.cumprod()

# Calculate and plot the investment return here
cumulative_return.mul(1000).plot(figsize=(10, 10))
plt.show()

png

Cumulative return on $1,000 invested in google vs apple by 1-year periods

Apple outperformed Google over the entire period, but this may have been different over various 1-year sub periods, so that switching between the two stocks might have yielded an even better result.

To analyze this, calculate that cumulative return for rolling 1-year periods, and then plot the returns to see when each stock was superior.

# Import numpy
import numpy as np

# Define a multi_period_return function
def multi_period_return(period_returns):
    return np.prod(period_returns + 1) - 1

# Calculate daily returns
daily_returns = data.pct_change()

# Calculate rolling_annual_returns
rolling_annual_returns = daily_returns.rolling('360D').apply(multi_period_return)

# Plot rolling_annual_returns
rolling_annual_returns.mul(100).plot(figsize=(10, 10));
plt.show()

png

Now this tells a different story, do you think it’s better to invest in Google or Apple?