Resampling a time series: Upsampling

8 minute read

Resampling or changing time series frequency

Changing the time series frequency: resampling

In this post we will see how to change the frequency of a time series. This is a very common operation because you often need to convert two time series to a common frequency to analyze them together.

Pandas DateTimeIndex allows us to set and change its frequency attribute, this also impacts the values of the DataFrame.

Upsampling vs Downsampling:

  • Upsampling: When you upsample by converting the data to a higher frequency, you create new rows and need to tell pandas how to fill or interpolate the missing values in these rows.
  • Downsampling: When you downsample, you reduce the number of rows, and need to tell pandas how to aggregate existing data.

In this post, we will focus on upsampling techniques. In the next post, we will look into downsampling.

First we will explore a few basic options that pandas provides to address resampling with asfreq() and reindex(), before diving deeper into the resample() method.

Use .reindex() to convert monthly to weekly data

We can use .reindex() to conform an existing time series to a DateTimeIndex at a different frequency.

We start by creating monthly data and then convert this data to weekly frequency while applying various fill logic options.

Create monthly_dates using pd.date_range with start, end and frequency alias 'M'.

import pandas as pd

# Set start and end dates
start = '2016-1-1'
end = '2016-2-29'

# Create monthly_dates here
monthly_dates = pd.date_range(start=start, end=end, freq='M')
monthly_dates
DatetimeIndex(['2016-01-31', '2016-02-29'], dtype='datetime64[ns]', freq='M')

Create and print the pd.Series monthly, passing the list [1, 2] as the data argument, and using monthly_dates as index.

# Create and print monthly here
monthly = pd.Series([1, 2], index=monthly_dates)
monthly
2016-01-31    1
2016-02-29    2
Freq: M, dtype: int64

Now let’s apply .reindex() to monthly three times: first without additional options, then with bfill and then with ffill, print()-ing each result.

# Create weekly_dates here
weekly_dates = pd.date_range(start=start, end=end, freq='W')

# Print monthly, reindexed using weekly_dates
print(monthly.reindex(weekly_dates))
print(monthly.reindex(weekly_dates, method='bfill'))
print(monthly.reindex(weekly_dates, method='ffill'))
2016-01-03    NaN
2016-01-10    NaN
2016-01-17    NaN
2016-01-24    NaN
2016-01-31    1.0
2016-02-07    NaN
2016-02-14    NaN
2016-02-21    NaN
2016-02-28    NaN
Freq: W-SUN, dtype: float64
2016-01-03    1
2016-01-10    1
2016-01-17    1
2016-01-24    1
2016-01-31    1
2016-02-07    2
2016-02-14    2
2016-02-21    2
2016-02-28    2
Freq: W-SUN, dtype: int64
2016-01-03    NaN
2016-01-10    NaN
2016-01-17    NaN
2016-01-24    NaN
2016-01-31    1.0
2016-02-07    1.0
2016-02-14    1.0
2016-02-21    1.0
2016-02-28    1.0
Freq: W-SUN, dtype: float64

Example: Convert montly unemployment data to weekly data

The civilian US unemployment rate is reported monthly. We may need more frequent data, but that’s no problem because we just saw how to upsample a time series.

We’ll work with the time series data for the last 20 years, and apply a few options to fill in missing values before plotting the weekly series.

data = pd.read_csv('data/stock_data/unrate_2000.csv', index_col='date', parse_dates=True)
data
UNRATE
date
2000-01-01 4.0
2000-02-01 4.1
2000-03-01 4.0
2000-04-01 3.8
2000-05-01 4.0
... ...
2017-01-01 4.8
2017-02-01 4.7
2017-03-01 4.5
2017-04-01 4.4
2017-05-01 4.3

209 rows × 1 columns

# Show first five rows of weekly series
print(data.asfreq('W').head())

# Show first five rows of weekly series with bfill option
print(data.asfreq('W', method='bfill').head())
            UNRATE
date              
2000-01-02     NaN
2000-01-09     NaN
2000-01-16     NaN
2000-01-23     NaN
2000-01-30     NaN
            UNRATE
date              
2000-01-02     4.1
2000-01-09     4.1
2000-01-16     4.1
2000-01-23     4.1
2000-01-30     4.1

Since we used bfill, feb’s value 4.1 is used to backfill January weeks. Note that 2000-01-02 was also NaN because we had values starting from 2000-01-01 and the actual start of the week was from 2000-01-02.

Similarly, when use the ffill option, jan’s value 4.0 is used to forward-fill January weeks.

# Create weekly series with ffill option and show first five rows
weekly_ffill = data.asfreq('W', method='ffill')
print(weekly_ffill.head())
            UNRATE
date              
2000-01-02     4.0
2000-01-09     4.0
2000-01-16     4.0
2000-01-23     4.0
2000-01-30     4.0
import matplotlib.pyplot as plt

# Plot weekly_fill starting 2015 here
weekly_ffill.loc['2015':].plot(figsize=(10,10))
plt.show()

png

Remember, we only had monthly unrate in our dataset, we used that to upsample by using ffill option to create weekly unrate dataset. In the above figure, we are only plotting urate from start of 2015.

Notice, how the intermediate values are same which creates all these step-like plot. We will see an alternative to this using interpolate.

Instead of using .reindex() we can also use the .resample() method.

Upsampling and interpolation using .resample() method

Next, we will dive deeper into pandas’ capabilities to convert time series frequencies.

The resample method follows a logic similar to groupby: It groups data within a resampling period, and applies a method to this group. It takes the value that results from this method, and assigns a new date within the resampling period. The new date is determined by a so-called offset, and for instance can be at the beginning or end of the period, or a custom location.

The 208 data points imported using read_csv since 2000 have no frequency information. An inspection of the first rows shows that the data are reported for the first of each calendar month. When looking at resampling by month, we have so far focused on month-end frequency. In other words, after resampling, new data will be assigned the last calendar day for each month.

monthly = pd.read_csv('data/stock_data/unrate_2000.csv', index_col='date', parse_dates=True)
monthly.head()
UNRATE
date
2000-01-01 4.0
2000-02-01 4.1
2000-03-01 4.0
2000-04-01 3.8
2000-05-01 4.0

There are, however, quite a few alternatives as shown in the table. Depending on your context, you can resample to the beginning or end of either the calendar or business month. The example dates show how business dates may deviate from the calendar month due to weekends and holidays.

Resampling logic

Resampling implements the following logic: When up-sampling, there will be more resampling periods than data points. Each resampling period will have a given date offset, for instance month-end frequency. You then need to decide how to create data for the new resampling periods. The new data points will be assigned to the date offsets.

In contrast, when down-sampling, there are more data points than resampling periods. Hence, you need to decide how to aggregate your data to obtain a single value for each date offset.

Resampling using ffill vs interpolate

TL;DR: ffill will use only 1 value to fill the gaps, so the plot will look step-like. Whereas, interpolate uses the something like linear model to get the values, so the plot looks smooth. Both are invoked on the Resampler object.

You can use .resample() to set a frequency for the unemployment rate. Let’s use month start frequency given the reporting dates. When you apply the resample() method, it returns a new object called Resampler object.

Just apply another method, and this object will again return a DataFrame. You can apply the asfreq method to just assign the data to their offset without modification. The dot-equal() method tells you that both approaches yield the same result.

Let’s now use a quarterly series, real GDP growth. You see that there is again no frequency info, but the first few rows confirm that the data are reported for the first day of each quarter.

gdp = pd.read_csv('data/stock_data/gdp_growth.csv', index_col='date', parse_dates=True)
print(gdp.info())
gdp.head(2)
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 41 entries, 2007-01-01 to 2017-01-01
Data columns (total 1 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   gdp_growth  41 non-null     float64
dtypes: float64(1)
memory usage: 656.0 bytes
None
gdp_growth
date
2007-01-01 0.2
2007-04-01 3.1

You can use resample to convert this series to month start frequency, and then forward fill logic to fill the gaps. We’re using add_suffix to distinguish the column label from the variation that we’ll produce next.

gdp_1 = gdp.resample('MS').ffill().add_suffix('_ffill')
gdp_1
gdp_growth_ffill
date
2007-01-01 0.2
2007-02-01 0.2
2007-03-01 0.2
2007-04-01 3.1
2007-05-01 3.1
... ...
2016-09-01 3.5
2016-10-01 2.1
2016-11-01 2.1
2016-12-01 2.1
2017-01-01 1.4

121 rows × 1 columns

Resample also let’s you interpolate the missing values, that is, fill in the values that lie on a straight line between existing quarterly growth rates. A look at the first few rows shows how interpolate averages existing values.

gdp_2 = gdp.resample('MS').interpolate().add_suffix('_interpolate')
gdp_2
gdp_growth_interpolate
date
2007-01-01 0.200000
2007-02-01 1.166667
2007-03-01 2.133333
2007-04-01 3.100000
2007-05-01 2.966667
... ...
2016-09-01 2.566667
2016-10-01 2.100000
2016-11-01 1.866667
2016-12-01 1.633333
2017-01-01 1.400000

121 rows × 1 columns

Example showing the difference b/w ffill and interpolate

You have recently used the civilian US unemployment rate, and converted it from monthly to weekly frequency using simple forward or backfill methods.

Compare your previous approach to the new .interpolate() method.

# Inspect data here
print(monthly.info())

# Create weekly dates
weekly_dates = pd.date_range(monthly.index.min(), monthly.index.max(), freq='W')

# Reindex monthly to weekly data
weekly = monthly.reindex(weekly_dates)

weekly
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 209 entries, 2000-01-01 to 2017-05-01
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   UNRATE  209 non-null    float64
dtypes: float64(1)
memory usage: 3.3 KB
None
UNRATE
2000-01-02 NaN
2000-01-09 NaN
2000-01-16 NaN
2000-01-23 NaN
2000-01-30 NaN
... ...
2017-04-02 NaN
2017-04-09 NaN
2017-04-16 NaN
2017-04-23 NaN
2017-04-30 NaN

905 rows × 1 columns

# Create ffill and interpolated columns
weekly['ffill'] = weekly.UNRATE.ffill()
weekly['interpolated'] = weekly.UNRATE.interpolate()

# Plot weekly
weekly.plot(figsize=(10, 10))
plt.show()

png

Plot shows how the new data points generated by interpolate lie on the line between the existing points, whereas forward filling creates a step-like pattern.

Compare two time series with different frequencies

When we are given a quarterly timeseries and we want to compare it with a monthly timeseries, we can upsample the quarterly timeseries using resample and interpolate as we have seen above.

For this, we will use the GDP data (quarterly) and unemployment data (monthly).

data = pd.read_csv('data/stock_data/debt_unemployment.csv', index_col='date', parse_dates=True)
print(data.info())
data
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 89 entries, 2010-01-01 to 2017-05-01
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Debt/GDP      29 non-null     float64
 1   Unemployment  89 non-null     float64
dtypes: float64(2)
memory usage: 2.1 KB
None
Debt/GDP Unemployment
date
2010-01-01 87.00386 9.8
2010-02-01 NaN 9.8
2010-03-01 NaN 9.9
2010-04-01 88.67047 9.9
2010-05-01 NaN 9.6
... ... ...
2017-01-01 104.30607 4.8
2017-02-01 NaN 4.7
2017-03-01 NaN 4.5
2017-04-01 NaN 4.4
2017-05-01 NaN 4.3

89 rows × 2 columns

Apply .interpolate() to data and assign this to interpolated, then inspect the result.

interpolated_data = data.interpolate()
print(interpolated_data.info())
interpolated_data
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 89 entries, 2010-01-01 to 2017-05-01
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Debt/GDP      89 non-null     float64
 1   Unemployment  89 non-null     float64
dtypes: float64(2)
memory usage: 2.1 KB
None
Debt/GDP Unemployment
date
2010-01-01 87.003860 9.8
2010-02-01 87.559397 9.8
2010-03-01 88.114933 9.9
2010-04-01 88.670470 9.9
2010-05-01 89.135103 9.6
... ... ...
2017-01-01 104.306070 4.8
2017-02-01 104.306070 4.7
2017-03-01 104.306070 4.5
2017-04-01 104.306070 4.4
2017-05-01 104.306070 4.3

89 rows × 2 columns

# Plot interpolated data here
interpolated_data.plot(secondary_y='Unemployment', figsize=(10, 10));
plt.show()

png

As unemployment goes down, debt-to-GDP-ratio goes up. Remember we started with quarterly debt/GDP data and we upsampled it to compare it with monthly unemployment data.