Correlation and Regression of two time series

7 minute read

Correlation and Regression of two time series

Correlation of two time series

Oftentimes, two time series vary together. For example, Consider the stock prices of JP Morgan and the S&P500. In general, when the market drops, JP Morgan drops as well, and when the market rises, JP Morgan also rises.

We can get this data from Yahoo Finance. I have pulled the stock price data fro JPM and SP500 from 2010 to 2020.

import pandas as pd

# jpm data
jpm = pd.read_csv('data/JPM.csv', index_col='Date', parse_dates=True)
jpm = jpm[['Close']]
jpm.columns = ['JPM']

# sp500 data
sp500 = pd.read_csv('data/sp500_2020.csv', index_col='Date', parse_dates=True)
sp500 = sp500[['Close']]
sp500.columns = ['SP500']

# join both
data = jpm.join(sp500, how='inner')
2010-01-14 44.689999 1148.459961
2010-01-15 43.680000 1136.030029
2010-01-19 43.279999 1150.229980
2010-01-20 43.400002 1138.040039
2010-01-21 40.540001 1116.479980

SP500 (market) vs JPM stock in the last decade

import matplotlib.pyplot as plt

ax1 = data.SP500.plot(color='blue', grid=True, label='SP500')
ax2 = data.JPM.plot(color='red', grid=True, secondary_y=True, label='JPM')

h1, l1 = ax1.get_legend_handles_labels()
h2, l2 = ax2.get_legend_handles_labels()

plt.legend(h1+h2, l1+l2, loc=2)


Consider two time series that are both trending. Even if the two series are totally unrelated, you could still get a very high correlation. That’s why, when you look at the correlation of say, two stocks, you should look at the correlation of their returns, not their levels. For instance, the two series, stock prices and UFO sightings, both trend up over time. Of course, there is no relationship between those two series, but the correlation is high (0.94). If you compute the correlation of percent changes, the correlation goes down to approximately zero.

Scatter plot of returns

So, let’s first calculate the returns of these two stocks.

daily_returns = data.pct_change().mul(100)
2010-01-14 NaN NaN
2010-01-15 -2.260011 -1.082313
2010-01-19 -0.915753 1.249963
2010-01-20 0.277271 -1.059783
2010-01-21 -6.589864 -1.894490

A scatter plot of the returns of JP Morgan and the returns of the market help to visualize the relationship between the two time series.

daily_returns.plot.scatter(x='JPM', y='SP500', figsize=(10, 10))
plt.title('JPM vs SP500 daily returns scatter plot')
Text(0.5, 1.0, 'JPM vs SP500 daily returns scatter plot')


Calculate the correlation b/w JPM and SP500 returns

The correlation coefficient is a measure of how much two series vary together. A correlation of one means that the two series have a perfect linear relationship with no deviations. High correlations mean that the two series strongly vary together. A low correlation means they vary together, but there is a weak association. And a high negative correlation means they vary in opposite directions, but still with a linear relationship.


Simple Linear Regressions of Time Series

A simple linear regression finds the slope, beta, and intercept, alpha, of a line that’s the best fit between a dependent variable, y, and an independent variable, x.

The x’s and y’s can be two time series.

A word about Regression: A linear regression is also known as Ordinary Least Squares, or OLS, because it minimizes the sum of the squared distances between the data points and the regression line.

Regression techniques are very common, and therefore there are many packages in Python that can be used. In statsmodels, there is OLS. In numpy, there is polyfit, and if you set degree equals 1, it fits the data to a line, which is a linear regression. Pandas has an ols method, and scipy has a linear regression function. Beware that the order of x and y is not consistent across packages.

We will be using the statsmodel package in this post, as it is very similar to other statistical tools like R, SAS and MATLAB.

Example: Regression of Small Cap Returns on Large Cap

Now let’s assume that we’ll regress the returns of the small cap stocks on the returns of large cap stocks. As a first step, we compute returns from prices using the pct_change method in pandas. We need to add a column of ones as a dependent, right hand side variable. The reason we have to do this is because the regression function assumes that if there is no constant column, then we want to run the regression without an intercept.

By adding a column of ones, statsmodels will compute the regression coefficient of that column as well, which can be interpreted as the intercept of the line. The statsmodels method add_constant() is a simple way to add a constant.

Example using Regression of two time series

R-squared measures how closely the data fit the regression line.

So, the R-squared in a simple regression is related to the correlation between the two variables (time series).

In particular, the magnitude of the correlation is the square root of the R-squared and the sign of the correlation is the sign of the regression coefficient.

In this example, we will start using the statistical package statsmodels, which performs much of the statistical modeling and testing that is found in R and software packages like SAS and MATLAB.

We will take two series, x(GOOG stock) and y(FB stock), compute their correlation, and then regress y on x using the function OLS(y,x) in the statsmodels.api library. Most linear regressions contain a constant term which is the intercept. To include a constant using the function OLS(), we need to add a column of 1's to the right hand size of the regression. To add a column of 1's use add_constant() as shown below.

import pandas as pd
df = pd.read_csv('data/stock_data/goog_fb.csv', index_col='date', parse_dates=True)
df.plot(figsize=(10, 10))
               FB    GOOG
2015-01-02  78.45  524.81
2015-01-05  77.19  513.87
                FB    GOOG
2016-12-29  116.35  782.79
2016-12-30  115.05  771.82



# Import the statsmodels module
import statsmodels.api as sm

# Compute correlation of x and y
x = pd.Series(df['GOOG'].values)
y = pd.Series(df['FB'].values)

correlation = x.corr(y)
print("The correlation between x and y is %4.2f" %(correlation))

# Convert the Series x to a DataFrame and name the column x
dfx = pd.DataFrame(x, columns=['x'])

# Add a constant to the DataFrame dfx
dfx1 = sm.add_constant(dfx)

# Regress y on dfx1
result = sm.OLS(y, dfx1).fit()

# Print out the results and look at the relationship between R-squared and the correlation above
The correlation between x and y is 0.93
                            OLS Regression Results                            
Dep. Variable:                      y   R-squared:                       0.873
Model:                            OLS   Adj. R-squared:                  0.873
Method:                 Least Squares   F-statistic:                     3444.
Date:                Thu, 18 Feb 2021   Prob (F-statistic):          6.74e-227
Time:                        14:10:19   Log-Likelihood:                -1625.3
No. Observations:                 504   AIC:                             3255.
Df Residuals:                     502   BIC:                             3263.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
const        -11.0498      1.961     -5.636      0.000     -14.902      -7.198
x              0.1693      0.003     58.684      0.000       0.164       0.175
Omnibus:                       18.447   Durbin-Watson:                   0.064
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               10.496
Skew:                          -0.179   Prob(JB):                      0.00526
Kurtosis:                       2.390   Cond. No.                     4.91e+03

[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.91e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

So, in this case, the slope of the regression is coefficient of x 0.1693. We can also pull individual params from the result, like so:

const   -11.049824
x         0.169329
dtype: float64

Another statistic to take note of is the R-squared of 0.873. There is a relationship b/w R-squared and Correlation. Square root of R-square will give the magnitude of correlation. As shown here

import numpy as np


Generally from a scatter plot, we can see that the correlation measures how closely the data are clustered along a line. The R-squared also measures how well the linear regression line fits the data. So as you would expect, there is a relationship between correlation and R-squared.

The magnitude of the correlation is the square root of the R-squared. And the sign of the correlation is the sign of the slope of the regression line.

If the regression line is positively sloped, the correlation is positive and if the the regression line is negatively sloped, the correlation is negative. In the example you just analyzed


Pandas helps make many time series tasks quick and efficient. For instance, we routinely encounter a situation where you need to merge two time series. But before you merge

Merging Time Series With Different Dates

Stock and bond markets in the U.S. are closed on different days. For example, although the bond market is closed on Columbus Day (around Oct 12) and Veterans Day (around Nov 11), the stock market is open on those days. One way to see the dates that the stock market is open and the bond market is closed is to convert both indexes of dates into sets and take the difference in sets.

The pandas .join() method is a convenient tool to merge the stock and bond DataFrames on dates when both markets are open.

  • Take the difference of the stock set minus the bond set to get those dates where the stock market has data but the bond market does not.
  • Merge the two DataFrames into a new DataFrame, stocks_and_bonds using the .join() method, which has the syntax df1.join(df2).
    • To get the intersection of dates, use the argument how='inner'
import pandas as pd

# Convert the stock index and bond index into sets
set_stock_dates = set(stocks.index)
set_bond_dates = set(bonds.index)

# Take the difference between the sets and print
print(set_stock_dates - set_bond_dates)

# Merge stocks and bonds DataFrames using join()
stocks_and_bonds = stocks.join(bonds, how='inner')