# Time Series Window functions: Rolling and Expanding metrics

## Window functions for time series in pandas

Window functions are useful because they allow you to operate on

sub periodsof your time series. In particular,window functions calculate metrics for the data inside the window. Then, the result of this calculation forms a new time series, where each data point represents a summary of several data points of the original time series.

In this post, we will learn about window functions for time series in pandas. We will discuss two main types of windows:

maintain the same size while they slide over the time series, so each new data point is the result of a given number of observations.`Rolling windows`

grow with the time series, so that the calculation that produces a new data point is the result of all previous data points.`Expanding windows`

## Rolling Windows

Let’s start with air quality trends for New York City since 2010. In particular, we’ll be using the daily Ozone concentration levels provided by the Environmental Protection Agency to calculate & plot the `90 and 360 day rolling average`

.

```
import pandas as pd
import matplotlib.pyplot as plt
# Import and inspect ozone data here
data = pd.read_csv('data/air_quality_data/ozone_nyc.csv', index_col='date', parse_dates=True)
print(data.info())
data.head()
```

```
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6291 entries, 2000-01-01 to 2017-03-31
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Ozone 6167 non-null float64
dtypes: float64(1)
memory usage: 98.3 KB
None
```

Ozone | |
---|---|

date | |

2000-01-01 | 0.004032 |

2000-01-02 | 0.009486 |

2000-01-03 | 0.005580 |

2000-01-04 | 0.008717 |

2000-01-05 | 0.013754 |

### integer window size

When you choose an integer-based window size, pandas will only calculate the mean if the window has no missing values. For example when we use `rolling(3)`

, means that we use the current observation as well as the two preceding ones in order to calculate our desired metric (`.mean()`

). So, in our case, the first two values will be `NaN`

- since with integer-based window size pandas will not calculate the mean if there are any missing values. Next, for `2000-01-03`

, since there are no missing values in the preceding two rows, we see the value `0.006366`

```
import numpy as np
np.mean([0.004032, 0.009486, 0.005580])
```

```
0.006365999999999999
```

```
data['Ozone'].rolling(3).mean()
```

```
date
2000-01-01 NaN
2000-01-02 NaN
2000-01-03 0.006366
2000-01-04 0.007928
2000-01-05 0.009350
...
2017-03-27 0.020319
2017-03-28 0.016974
2017-03-29 0.017950
2017-03-30 0.024757
2017-03-31 0.032128
Name: Ozone, Length: 6291, dtype: float64
```

Note, you can change this default by setting the

`min_periods`

parameter to a value smaller than the window size of 3. For example, when we use`min_periods=2`

, this means that pandas will calcuate mean for that data point even if 1 value is missing.

```
data['Ozone'].rolling(3, min_periods=2).mean()
```

```
date
2000-01-01 NaN
2000-01-02 0.006759
2000-01-03 0.006366
2000-01-04 0.007928
2000-01-05 0.009350
...
2017-03-27 0.020319
2017-03-28 0.016974
2017-03-29 0.017950
2017-03-30 0.024757
2017-03-31 0.032128
Name: Ozone, Length: 6291, dtype: float64
```

### string window size

We can also create windows based on a date offset. If you choose `30D`

, for instance, the window will contain the days when CO2 was not NaN during the last 30 calendar days. While the window is fixed in terms of period length, the number of observations will vary. Let’s take a look at what the rolling mean looks like.

In the below example, we are using `rolling(3D)`

, so pandas will look at the 3 preceding values, and if any of them are `NaN`

, then it will use the non-missing values to calculate the mean. We can achieve the same by using `min_periods=1`

when using int-based window.

```
data['Ozone'].rolling('3D').mean()
```

```
date
2000-01-01 0.004032
2000-01-02 0.006759
2000-01-03 0.006366
2000-01-04 0.007928
2000-01-05 0.009350
...
2017-03-27 0.020319
2017-03-28 0.016974
2017-03-29 0.017950
2017-03-30 0.024757
2017-03-31 0.032128
Name: Ozone, Length: 6291, dtype: float64
```

Alright, now that we understand the subtle difference between `integer-windows`

and `string-windows`

, lets get back to plotting the rolling means.

### Rolling average air quality since 2010 for new york city

```
# Calculate 90d and 360d rolling mean for the last price
data['90D'] = data['Ozone'].rolling(window='90D').mean()
data['360D'] = data['Ozone'].rolling(window='360D').mean()
```

```
data.head()
```

Ozone | 90D | 360D | |
---|---|---|---|

date | |||

2000-01-01 | 0.004032 | 0.004032 | 0.004032 |

2000-01-02 | 0.009486 | 0.006759 | 0.006759 |

2000-01-03 | 0.005580 | 0.006366 | 0.006366 |

2000-01-04 | 0.008717 | 0.006954 | 0.006954 |

2000-01-05 | 0.013754 | 0.008314 | 0.008314 |

```
# Plot data
data['2010':].plot(figsize=(10, 10))
plt.title('New York City')
plt.show()
```

Do the different rolling windows help you see any long term trends that are hard to spot in the original data?

### Rolling 360-day median & std. deviation for nyc ozone data since 2000

The daily data are very volatile, so using a longer term rolling average can help reveal a longer term trend.

```
# Import and inspect ozone data here
data = pd.read_csv('data/air_quality_data/ozone_nyc.csv', parse_dates=['date'], index_col='date').dropna()
# Calculate the rolling mean and std here
rolling_stats = data.Ozone.rolling(360).agg(['mean', 'std'])
# Join rolling_stats with ozone data
stats = data.join(rolling_stats)
# Plot stats
stats.plot(subplots=True, figsize=(10, 10))
plt.show()
```

```
stats.head()
```

Ozone | mean | std | |
---|---|---|---|

date | |||

2000-01-01 | 0.004032 | NaN | NaN |

2000-01-02 | 0.009486 | NaN | NaN |

2000-01-03 | 0.005580 | NaN | NaN |

2000-01-04 | 0.008717 | NaN | NaN |

2000-01-05 | 0.013754 | NaN | NaN |

### Rolling quantiles for daily air quality in nyc

Next we calculate the rolling quantiles to describe changes in the dispersion of a time series over time in a way that is less sensitive to outliers than using the mean and standard deviation.

Let’s calculate rolling quantiles - at 10%, 50% (median) and 90% - of the distribution of daily average ozone concentration in NYC using a 360-day rolling window.

```
# Import and inspect ozone data here
data = pd.read_csv('data/air_quality_data/ozone_nyc.csv', parse_dates=['date'], index_col='date')
print(data.info())
```

```
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6291 entries, 2000-01-01 to 2017-03-31
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Ozone 6167 non-null float64
dtypes: float64(1)
memory usage: 98.3 KB
None
```

As we have some missing values, instead of using `dropna()`

as we did above, we can use `resample('D').interpolate()`

to fill in the values.

```
# Resample, interpolate and inspect ozone data here
data = data.resample('D').interpolate()
print(data.info())
data.head()
```

```
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6300 entries, 2000-01-01 to 2017-03-31
Freq: D
Data columns (total 1 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Ozone 6300 non-null float64
dtypes: float64(1)
memory usage: 98.4 KB
None
```

Ozone | |
---|---|

date | |

2000-01-01 | 0.004032 |

2000-01-02 | 0.009486 |

2000-01-03 | 0.005580 |

2000-01-04 | 0.008717 |

2000-01-05 | 0.013754 |

```
# Create the rolling window
rolling = data['Ozone'].rolling(360)
# Insert the rolling quantiles to the monthly returns
data['q10'] = rolling.quantile(0.1)
data['q50'] = rolling.quantile(0.5)
data['q90'] = rolling.quantile(0.9)
# Plot the data
data.plot(figsize=(10, 10))
plt.show()
```

The rolling quantiles help show the volatility of the series.

## Expanding Windows

Now we will move on from rolling to expanding windows. We will now calculate metrics for **groups that get larger** to include all data up to the current date. Each data point of the resulting time series reflects all historical values up to that point.

Expanding windows are useful to calculate for instance a cumulative rate of return, or a running maximum or minimum. Example: **What is my cumulative return on a $1000 investment in google?**

### Expanding windows in pandas

In pandas, we can use either the method `.expanding()`

, which works just like `.rolling()`

, or in a few cases shorthand methods for the cumulative sum, product, min and max.

To understand the basic idea, take a look at this simple example: We start with a list of numbers from 0 to 4. You can calculate the same result using either the method expanding followed by the sum method, or apply the cumulative sum method directly You simply get a list where each number is the sum of all preceding values.

```
df = pd.DataFrame({'data': range(5)})
df['expanding_sum'] = df['data'].expanding().sum()
df['cumulative_sum'] = df['data'].cumsum()
df
```

data | expanding_sum | cumulative_sum | |
---|---|---|---|

0 | 0 | 0.0 | 0 |

1 | 1 | 1.0 | 1 |

2 | 2 | 3.0 | 3 |

3 | 3 | 6.0 | 6 |

4 | 4 | 10.0 | 10 |

### Difference b/w cumsum() and diff()

Earlier we have seen that expanding windows allow us to run cumulative calculations. The cumulative sum method has in fact the opposite effect of the `.diff()`

method.

To illustrate this, let’s use the Google stock price time series, create the differences between prices, and reconstruct the series using the cumulative sum.

```
data = pd.read_csv('data/stock_data/google.csv', index_col='Date', parse_dates=True)
data = data.dropna()
print(data.shape)
data.head()
```

```
(756, 1)
```

Close | |
---|---|

Date | |

2014-01-02 | 556.00 |

2014-01-03 | 551.95 |

2014-01-06 | 558.10 |

2014-01-07 | 568.86 |

2014-01-08 | 570.04 |

First, lets run a `.diff()`

method. Notice how we get `NaNs`

when atleast one of the values is missing.

```
differences = data.diff()
differences.head()
```

Close | |
---|---|

Date | |

2014-01-02 | NaN |

2014-01-03 | -4.05 |

2014-01-06 | 6.15 |

2014-01-07 | 10.76 |

2014-01-08 | 1.18 |

```
# drop the first value
differences = data.diff().dropna()
# Select start price
start_price = data.first('D')
start_price
```

Close | |
---|---|

Date | |

2014-01-02 | 556.0 |

```
start_price.append(differences)
```

Close | |
---|---|

Date | |

2014-01-02 | 556.00 |

2014-01-03 | -4.05 |

2014-01-06 | 6.15 |

2014-01-07 | 10.76 |

2014-01-08 | 1.18 |

... | ... |

2016-12-23 | -1.35 |

2016-12-27 | 1.64 |

2016-12-28 | -6.50 |

2016-12-29 | -2.26 |

2016-12-30 | -10.97 |

756 rows × 1 columns

```
# Calculate cumulative sum
cumulative_sum = start_price.append(differences).cumsum()
# Validate cumulative sum equals data
print(data.equals(cumulative_sum))
```

```
True
```

### How to calculate a running return?

The simple period return is just the current price divided by the last price minus 1.

The return over several periods is the product of all period returns after adding 1, and then subtracting 1 from the product. Pandas makes these calculations easy ‘ you have already seen the methods for percent change and basic math, and now you’ll learn about the cumulative product. To get the cumulative or running rate of return on the SP500, we will follow the steps described above in pandas.

```
data = pd.read_csv('data/stock_data/sp500.csv', index_col='date', parse_dates=True)
data.head()
```

SP500 | |
---|---|

date | |

2007-06-29 | 1503.35 |

2007-07-02 | 1519.43 |

2007-07-03 | 1524.87 |

2007-07-05 | 1525.40 |

2007-07-06 | 1530.44 |

```
pr = data['SP500'].pct_change() # period return
pr_plus_one = pr.add(1) # or pr + 1
cumulative_return = pr_plus_one.cumprod().sub(1)
cumulative_return.mul(100).plot(figsize=(10, 10))
plt.title('SP500 running rate of return')
```

```
Text(0.5, 1.0, 'SP500 running rate of return')
```

Looks like the SP500 is up `60%`

since 2007, despite being down `60%`

in 2009.

### Running min and max of a timeseries

We can also easily get the running min and max of a time series. Just apply the expanding method and the respective aggregation method. The red and green line outline the min and max up to the current date for each day.

```
data['running_min'] = data['SP500'].expanding().min()
data['running_max'] = data['SP500'].expanding().max()
data.plot(figsize=(10, 10))
```

```
<AxesSubplot:xlabel='date'>
```

### Cumulative return on `$1,000`

invested in google vs apple

Now let’s compare how much `$1,000`

would be worth if invested in `Google ('GOOG')`

or `Apple ('AAPL')`

in 2010.

```
data = pd.read_csv('data/stock_data/apple_google.csv', index_col='Date', parse_dates=True)
data.head()
```

AAPL | GOOG | |
---|---|---|

Date | ||

2010-01-04 | NaN | 313.06 |

2010-01-05 | NaN | 311.68 |

2010-01-06 | NaN | 303.83 |

2010-01-07 | NaN | 296.75 |

2010-01-08 | NaN | 300.71 |

```
# Define your investment
investment = 1000
# Calculate the daily returns here
returns = data.pct_change()
# Calculate the cumulative returns here
returns_plus_one = returns + 1
cumulative_return = returns_plus_one.cumprod()
# Calculate and plot the investment return here
cumulative_return.mul(1000).plot(figsize=(10, 10))
plt.show()
```

### Cumulative return on $1,000 invested in google vs apple by 1-year periods

Apple outperformed Google over the entire period, but this may have been different over various 1-year sub periods, so that switching between the two stocks might have yielded an even better result.

To analyze this, calculate that cumulative return for rolling 1-year periods, and then plot the returns to see when each stock was superior.

```
# Import numpy
import numpy as np
# Define a multi_period_return function
def multi_period_return(period_returns):
return np.prod(period_returns + 1) - 1
# Calculate daily returns
daily_returns = data.pct_change()
# Calculate rolling_annual_returns
rolling_annual_returns = daily_returns.rolling('360D').apply(multi_period_return)
# Plot rolling_annual_returns
rolling_annual_returns.mul(100).plot(figsize=(10, 10));
plt.show()
```

Now this tells a different story, do you think it’s better to invest in Google or Apple?