Basics of manipulating time series

9 minute read

Basics of manipulating Time Series data

This post lays the foundations to leverage the powerful time series functionality made available by how Pandas represents dates, in particular by the DateTimeIndex. We will first see the basic time series methods and transformations. Using several examples we will understand the power of pandas’s DateTimeIndex.

Basic methods: parse dates, slicing and change frequency

Basic methods include:

  • parsing dates provided as strings
  • converting the result into the matching pandas data type called datetime64.
  • selecting subperiods of your time series
  • setting or changing the frequency of the DateTimeIndex.

You can change the frequency to a higher or lower value: upsampling involves increasing the time frequency, which requires generating new data. Downsampling means decreasing the time frequency, which requires aggregating data.

Example: Air Quality dataset

Our first data set is a time series data set with air quality data (ozone, pm25, and carbon monoxide for NYC, 2000-2017) . You will often have to deal with dates that are of type object, or string. You’ll notice a column called ‘date’ that is of data type ‘object’. However, when you print the first few rows using the dot-head method, you see that it contains dates. To convert the strings to the correct datatype use pd.to_datetime().

import pandas as pd
import matplotlib.pyplot as plt

data = pd.read_csv('data/air_quality_data/nyc.csv')

# Inspect data
print(data.info())

# Convert the date column to datetime64
data['date'] = pd.to_datetime(data['date'])

# Set date column as index
data.set_index('date', inplace=True)

# Inspect data
print(data.info())

# Plot data
data.plot(subplots=True, figsize=(10,10))
plt.show()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6317 entries, 0 to 6316
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   date    6317 non-null   object
 1   ozone   6317 non-null   float64
 2   pm25    6317 non-null   float64
 3   co      6317 non-null   float64
dtypes: float64(3), object(1)
memory usage: 197.5+ KB
None
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 6317 entries, 1999-07-01 to 2017-03-31
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   ozone   6317 non-null   float64
 1   pm25    6317 non-null   float64
 2   co      6317 non-null   float64
dtypes: float64(3)
memory usage: 197.4 KB
None

png

Partial string indexing example: Compare stock price over three years

For this, we will look at the yahoo stock price data. We will compare how the same stock price did over three years and plot it.

yahoo = pd.read_csv('data/stock_data/yahoo.csv')
yahoo['date'] = pd.to_datetime(yahoo['date'])
yahoo.set_index('date', inplace=True)
yahoo.head()
price
date
2013-01-02 20.08
2013-01-03 19.78
2013-01-04 19.86
2013-01-07 19.40
2013-01-08 19.66

You can use strings that represent a complete date, or relevant parts of a date. For example, if you just pass a string representing a year, pandas returns all dates within this year.

If you pass a slice that starts with one month and ends at another, you get all dates within that range. Note that the date range will be inclusive of the end date, different from other intervals in python.

You can also use .loc with a complete date and a column label to select a specific stock price.

# Create dataframe prices here
prices = pd.DataFrame()

# Select data for each year and concatenate with prices here
for year in ['2013', '2014', '2015']:
    price_per_year = yahoo.loc[year, ['price']].reset_index(drop=True)
    price_per_year.rename(columns={'price': year}, inplace=True)
    prices = pd.concat([prices, price_per_year], axis=1)

# Plot prices
prices.plot(figsize=(10, 10))
plt.show()

png

Set and change time series frequency using asfreq()

You may have noticed that our DateTimeIndex did not have frequency information. You can set the frequency information using .asfreq. The alias 'D' stands for calendar day frequency. As a result, the DateTimeIndex now contains many dates where stock wasn’t bought or sold.

These new dates have missing values. This is also called upsampling, because the new DataFrame is of higher frequency as the original version.

Now we will see how to assign a frequency to a DateTimeIndex, and then change this frequency. For this we use data on the daily carbon monoxide concentration in NYC, LA and Chicago from 2005-17.

We’ll set the frequency to calendar daily and then resample to monthly frequency, and visualize both series to see how the different frequencies affect the data.

co = pd.read_csv('data/air_quality_data/co_cities.csv', index_col='date', parse_dates=True)
co.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1898 entries, 2005-01-01 to 2010-12-31
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Chicago      1898 non-null   float64
 1   Los Angeles  1898 non-null   float64
 2   New York     1898 non-null   float64
dtypes: float64(3)
memory usage: 59.3 KB
co.head()
Chicago Los Angeles New York
date
2005-01-01 0.317763 0.777657 0.639830
2005-01-03 0.520833 0.349547 0.969572
2005-01-04 0.477083 0.626630 0.905208
2005-01-05 0.348822 0.613814 0.769176
2005-01-06 0.572917 0.792596 0.815761
# Inspect data
print(co.info())

# Set the frequency to calendar daily
co = co.asfreq('D')

# Plot the data
co.plot(subplots=True, figsize=(10, 10))
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 2191 entries, 2005-01-01 to 2010-12-31
Freq: D
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Chicago      1898 non-null   float64
 1   Los Angeles  1898 non-null   float64
 2   New York     1898 non-null   float64
dtypes: float64(3)
memory usage: 68.5 KB
None





array([<AxesSubplot:xlabel='date'>, <AxesSubplot:xlabel='date'>,
       <AxesSubplot:xlabel='date'>], dtype=object)

png

# Set frequency to monthly
co = co.asfreq('M')

# Inspect data
print(co.info())

# Plot the data
co.plot(subplots=True, figsize=(10, 10))
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 72 entries, 2005-01-31 to 2010-12-31
Freq: M
Data columns (total 3 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   Chicago      68 non-null     float64
 1   Los Angeles  68 non-null     float64
 2   New York     68 non-null     float64
dtypes: float64(3)
memory usage: 2.2 KB
None





array([<AxesSubplot:xlabel='date'>, <AxesSubplot:xlabel='date'>,
       <AxesSubplot:xlabel='date'>], dtype=object)

png

Manipulating Time Series: Lags, Shifts, changes, and percent_change

Now, we will begin to manipulate time series data. In particular, we will see:

  1. how to move our data across time so that you can compare values at different points in time. This involves shifting values into the future, or creating lags by moving data into the past.
  2. We will also see how to calculate changes between values at different points in time.
  3. Lastly we will see how to calculate the change between values in percentage terms, also called the rate of growth.

Pandas has builtin methods for these calculations that leverage the DateTimeIndex.

Shifting and Lagging to compare

The first method to manipulate time series is .shift(), which allows you shift all values in a Series or DataFrame by a number of periods to a different time along the DateTimeIndex.

# Import data here
google = pd.read_csv('data/stock_data/google.csv', parse_dates=['Date'], index_col='Date')

# Set data frequency to business daily
google = google.asfreq('B')

# Create 'lagged' and 'shifted'
google['lagged'] = google.Close.shift(periods=-90)
google['shifted'] = google.Close.shift(periods=90)

# Plot the google price series
google.plot(figsize=(10,10))
<AxesSubplot:xlabel='Date'>

png

google.head()
Close lagged shifted
Date
2014-01-02 556.00 511.00 NaN
2014-01-03 551.95 518.73 NaN
2014-01-06 558.10 529.92 NaN
2014-01-07 568.86 533.09 NaN
2014-01-08 570.04 526.65 NaN
google.tail()
Close lagged shifted
Date
2016-12-26 NaN NaN 772.15
2016-12-27 791.55 NaN 772.08
2016-12-28 785.05 NaN 769.64
2016-12-29 782.79 NaN 769.41
2016-12-30 771.82 NaN 769.54

Fantastic! Now you can visually compare the time series to itself at different points in time.

The ‘shifted’ version of the stock price has all prices moved by 90 periods into the future. As a result, the first 90 values in the series are now missing.

In contrast, the lagged version of the stock price is moved 90 periods into the past. In this case, the last 90 values are now missing. To shift data into the past, use negative period numbers.

Shifting data is useful to compare data at different points in time.

One-period percent change

Calculate one-period percent change: You can, for instance, calculate the rate of change from period to period, which is also called financial return in finance.

The method .div() allows you not only to divide a Series by a value, but by an entire Series, for instance by another column in the same DataFrame. Pandas makes sure the dates for both series match up, and will divide the aligned values accordingly. As a result, you get the relative change from the last period for every price, that is, the factor by which you need to multiply the last price to get the current price.

Relative change in percent terms

Calculate the relative change in percent terms: As you have seen before, you can chain all DataFrame methods that return a DataFrame. The returned DataFrame will be used as input for the next calculation. Here, we are subtracting 1 and multiplying the result by 100 to obtain the relative change in percentage terms.

Using pandas diff() method

diff method: Another time series method is .diff(), which calculates the change between values at different points in time. By default, the ‘diff’ takes the difference in value for two adjacent periods.

Using pandas pct_change() method

pct_change: Finally, since it is such a common operation, pandas has a builtin method for you to calculate the percent change directly. Just select a column and call pct_change. Multiply by 100 to get the same result as we got before when we calculated the returns.

All these methods have the periods: All these methods have a 'periods' keyword that you have already seen for .shift and that defaults to the value 1. If you provide a higher value, you can calculate returns/pct_change for data points several periods apart, as in this example, for prices three trading days apart.

Example illustrating .diff()

Let’s load Yahoo stock prices for the years 2013 to 2015, set the frequency to business daily, and assigned the result to yahoo.

yahoo = pd.read_csv('data/stock_data/yahoo.csv')
yahoo['date'] = pd.to_datetime(yahoo['date'])
yahoo.set_index('date', inplace=True)
yahoo
price
date
2013-01-02 20.08
2013-01-03 19.78
2013-01-04 19.86
2013-01-07 19.40
2013-01-08 19.66
... ...
2015-12-25 NaN
2015-12-28 33.60
2015-12-29 34.04
2015-12-30 33.37
2015-12-31 33.26

782 rows × 1 columns

# Create a new column called shifted_30 that contains the 'price'
# shifted by 30 business days into the future.

# Created shifted_30 here
yahoo['shifted_30'] = yahoo.price.shift(periods=30)


# Subtract 'shifted_30' from 'price', and assign the result to a new column, 'change_30'.
# Subtract shifted_30 from price
yahoo['change_30'] = yahoo.price - yahoo.shifted_30

# Apply .diff(), setting periods to 30, and assign the result to a new column, 'diff_30'.
# Get the 30-day price difference
yahoo['diff_30'] = yahoo.price.diff(periods=30)

# Inspect the last five rows of price
yahoo.tail()
price shifted_30 change_30 diff_30
date
2015-12-25 NaN 32.19 NaN NaN
2015-12-28 33.60 32.94 0.66 0.66
2015-12-29 34.04 32.86 1.18 1.18
2015-12-30 33.37 32.98 0.39 0.39
2015-12-31 33.26 32.62 0.64 0.64
# Show the value_counts of the difference between change_30 and diff_30
yahoo['change_30'].sub(yahoo['diff_30']).value_counts()
0.0    703
dtype: int64

There’s usually more than one way to get to the same result when working with data.

Finding patterns in time series using .pct_change()

We can use .pct_change() function to calculate returns for various calendar day periods, and plot the result to compare the different patterns. We will use the ‘GOOG’ stock prices for the years 2014-2016, set the frequency to calendar daily, and assign the result to google.

# Import data here
google = pd.read_csv('data/stock_data/google.csv', parse_dates=['Date'], index_col='Date')

# Set data frequency to business daily
google = google.asfreq('B')

Let’s create the columns 'daily_return', 'monthly_return', and 'annual_return' that contain the pct_change() of ‘Close’ for 1, 30 and 360 calendar days, respectively, and multiply each by 100.

# Create daily_return
google['daily_return'] = google.Close.pct_change(periods=1).multiply(100)

# Create monthly_return
google['monthly_return'] = google.Close.pct_change(periods=30).multiply(100)

# Create annual_return
google['annual_return'] = google.Close.pct_change(periods=360).multiply(100)

# Plot the result
google.plot(subplots=True, figsize=(10,10))
plt.show()

png

Now we can see how the stock price returns where for three time periods: daily, monthly, and annually.