Correlation and Regression of two time series
Correlation of two time series
Oftentimes, two time series vary together. For example, Consider the stock prices of
JP Morgan and the
S&P500. In general, when the market drops, JP Morgan drops as well, and when the market rises, JP Morgan also rises.
We can get this data from Yahoo Finance. I have pulled the stock price data fro
2010 to 2020.
import pandas as pd # jpm data jpm = pd.read_csv('data/JPM.csv', index_col='Date', parse_dates=True) jpm = jpm[['Close']] jpm.columns = ['JPM'] # sp500 data sp500 = pd.read_csv('data/sp500_2020.csv', index_col='Date', parse_dates=True) sp500 = sp500[['Close']] sp500.columns = ['SP500'] # join both data = jpm.join(sp500, how='inner') data.head()
SP500 (market) vs JPM stock in the last decade
import matplotlib.pyplot as plt plt.figure(figsize=(10,10)) ax1 = data.SP500.plot(color='blue', grid=True, label='SP500') ax2 = data.JPM.plot(color='red', grid=True, secondary_y=True, label='JPM') h1, l1 = ax1.get_legend_handles_labels() h2, l2 = ax2.get_legend_handles_labels() plt.legend(h1+h2, l1+l2, loc=2) plt.show()
Common Mistake: Correlation of Two Trending Series
Consider two time series that are both trending. Even if the two series are totally unrelated, you could still get a very high correlation. That’s why, when you look at the correlation of say, two stocks, you should look at the correlation of their returns, not their levels. For instance, the two series, stock prices and UFO sightings, both trend up over time. Of course, there is no relationship between those two series, but the correlation is high (0.94). If you compute the correlation of percent changes, the correlation goes down to approximately zero.
Scatter plot of returns
So, let’s first calculate the returns of these two stocks.
daily_returns = data.pct_change().mul(100) daily_returns.head()
A scatter plot of the returns of JP Morgan and the returns of the market help to visualize the relationship between the two time series.
daily_returns.plot.scatter(x='JPM', y='SP500', figsize=(10, 10)) plt.title('JPM vs SP500 daily returns scatter plot')
Text(0.5, 1.0, 'JPM vs SP500 daily returns scatter plot')
Calculate the correlation b/w JPM and SP500 returns
The correlation coefficient is a measure of how much two series vary together. A correlation of one means that the two series have a perfect linear relationship with no deviations. High correlations mean that the two series strongly vary together. A low correlation means they vary together, but there is a weak association. And a high negative correlation means they vary in opposite directions, but still with a linear relationship.
Simple Linear Regressions of Time Series
A simple linear regression finds the slope,
beta, and intercept,
alpha, of a line that’s the best fit between a dependent variable,
y, and an independent variable,
The x’s and y’s can be two time series.
A word about Regression: A linear regression is also known as Ordinary Least Squares, or OLS, because it minimizes the sum of the squared distances between the data points and the regression line.
Regression techniques are very common, and therefore there are many packages in Python that can be used. In statsmodels, there is OLS. In numpy, there is polyfit, and if you set degree equals 1, it fits the data to a line, which is a linear regression. Pandas has an ols method, and scipy has a linear regression function. Beware that the order of x and y is not consistent across packages.
We will be using the
statsmodel package in this post, as it is very similar to other statistical tools like R, SAS and MATLAB.
Example: Regression of Small Cap Returns on Large Cap
Now let’s assume that we’ll regress the returns of the small cap stocks on the returns of large cap stocks. As a first step, we compute returns from prices using the
pct_change method in pandas. We need to add a column of ones as a dependent, right hand side variable. The reason we have to do this is because the regression function assumes that if there is no constant column, then we want to run the regression without an intercept.
By adding a column of ones,
statsmodels will compute the regression coefficient of that column as well, which can be interpreted as the intercept of the line. The statsmodels method
add_constant() is a simple way to add a constant.
Example using Regression of two time series
R-squared measures how closely the data fit the regression line.
So, the R-squared in a simple regression is related to the correlation between the two variables (time series).
In particular, the magnitude of the correlation is the square root of the R-squared and the sign of the correlation is the sign of the regression coefficient.
In this example, we will start using the statistical package statsmodels, which performs much of the statistical modeling and testing that is found in R and software packages like SAS and MATLAB.
We will take two series,
GOOG stock) and
FB stock), compute their correlation, and then regress y on x using the function
OLS(y,x) in the
statsmodels.api library. Most linear regressions contain a constant term which is the intercept. To include a constant using the function
OLS(), we need to add a column of
1's to the right hand size of the regression. To add a column of
add_constant() as shown below.
import pandas as pd df = pd.read_csv('data/stock_data/goog_fb.csv', index_col='date', parse_dates=True) print(df.head(2)) print(df.tail(2)) df.plot(figsize=(10, 10))
FB GOOG date 2015-01-02 78.45 524.81 2015-01-05 77.19 513.87 FB GOOG date 2016-12-29 116.35 782.79 2016-12-30 115.05 771.82 <AxesSubplot:xlabel='date'>
# Import the statsmodels module import statsmodels.api as sm # Compute correlation of x and y x = pd.Series(df['GOOG'].values) y = pd.Series(df['FB'].values) correlation = x.corr(y) print("The correlation between x and y is %4.2f" %(correlation)) # Convert the Series x to a DataFrame and name the column x dfx = pd.DataFrame(x, columns=['x']) # Add a constant to the DataFrame dfx dfx1 = sm.add_constant(dfx) # Regress y on dfx1 result = sm.OLS(y, dfx1).fit() # Print out the results and look at the relationship between R-squared and the correlation above print(result.summary())
The correlation between x and y is 0.93 OLS Regression Results ============================================================================== Dep. Variable: y R-squared: 0.873 Model: OLS Adj. R-squared: 0.873 Method: Least Squares F-statistic: 3444. Date: Thu, 18 Feb 2021 Prob (F-statistic): 6.74e-227 Time: 14:10:19 Log-Likelihood: -1625.3 No. Observations: 504 AIC: 3255. Df Residuals: 502 BIC: 3263. Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ const -11.0498 1.961 -5.636 0.000 -14.902 -7.198 x 0.1693 0.003 58.684 0.000 0.164 0.175 ============================================================================== Omnibus: 18.447 Durbin-Watson: 0.064 Prob(Omnibus): 0.000 Jarque-Bera (JB): 10.496 Skew: -0.179 Prob(JB): 0.00526 Kurtosis: 2.390 Cond. No. 4.91e+03 ============================================================================== Notes:  Standard Errors assume that the covariance matrix of the errors is correctly specified.  The condition number is large, 4.91e+03. This might indicate that there are strong multicollinearity or other numerical problems.
So, in this case, the slope of the regression is coefficient of x
0.1693. We can also pull individual params from the result, like so:
const -11.049824 x 0.169329 dtype: float64
Another statistic to take note of is the R-squared of
0.873. There is a relationship b/w R-squared and Correlation. Square root of R-square will give the magnitude of correlation. As shown here
import numpy as np np.sqrt(0.873)
Generally from a scatter plot, we can see that the correlation measures how closely the data are clustered along a line. The R-squared also measures how well the linear regression line fits the data. So as you would expect, there is a relationship between correlation and R-squared.
The magnitude of the correlation is the square root of the R-squared. And the sign of the correlation is the sign of the slope of the regression line.
If the regression line is positively sloped, the correlation is positive and if the the regression line is negatively sloped, the correlation is negative. In the example you just analyzed
Pandas helps make many time series tasks quick and efficient. For instance, we routinely encounter a situation where you need to merge two time series. But before you merge
Merging Time Series With Different Dates
Stock and bond markets in the U.S. are closed on different days. For example, although the bond market is closed on Columbus Day (around Oct 12) and Veterans Day (around Nov 11), the stock market is open on those days. One way to see the dates that the stock market is open and the bond market is closed is to convert both indexes of dates into sets and take the difference in sets.
.join() method is a convenient tool to merge the stock and bond DataFrames on dates when both markets are open.
- Take the difference of the stock set minus the bond set to get those dates where the stock market has data but the bond market does not.
- Merge the two DataFrames into a new DataFrame, stocks_and_bonds using the
.join()method, which has the syntax
- To get the intersection of dates, use the argument
- To get the intersection of dates, use the argument
import pandas as pd # Convert the stock index and bond index into sets set_stock_dates = set(stocks.index) set_bond_dates = set(bonds.index) # Take the difference between the sets and print print(set_stock_dates - set_bond_dates) # Merge stocks and bonds DataFrames using join() stocks_and_bonds = stocks.join(bonds, how='inner')