Exploring Companies Listed on Stock Exchanges using pandas

11 minute read

Exploring Companies Listed on Stock Exchanges using pandas

Deciding which stock to purchase can be a daunting task. To help with this task we can interrogate the data to get some answers. Suppose that we want to know:

  • What is the largest company listed on NYSE?
  • What is the largest company listed on NYSE by Sector/Industry?
  • What is the median market capitalization by Sector/IPO?
  • What is the company’s value by exchange and Sector?

Get the companies listed on Stock Exchanges: NASDAQ, NYSE and AMEX

We will use data on companies listed on the stock exchanges NASDAQ, NYSE, and AMEX with information on company name, stock symbol, last market capitalization and price, sector or industry group, and IPO year.

1. Using read_excel to load individual sheet

When you know the sheet_name in the excel, you can use the read_excel as shown below.

import pandas as pd
import matplotlib.pyplot as plt

# Import the data
nyse = pd.read_excel('data/listings.xlsx', sheet_name='nyse', na_values='n/a')

# Display the head of the data
nyse.head()
Stock Symbol Company Name Last Sale Market Capitalization IPO Year Sector Industry
0 DDD 3D Systems Corporation 14.48 1.647165e+09 NaN Technology Computer Software: Prepackaged Software
1 MMM 3M Company 188.65 1.127366e+11 NaN Health Care Medical/Dental Instruments
2 WBAI 500.com Limited 13.96 5.793129e+08 2013.0 Consumer Services Services-Misc. Amusement & Recreation
3 WUBA 58.com Inc. 36.11 5.225238e+09 2013.0 Technology Computer Software: Programming, Data Processing
4 AHC A.H. Belo Corporation 6.20 1.347351e+08 NaN Consumer Services Newspapers/Magazines

2. Using pd.ExcelFile() and read_excel() to read multiple sheets into a dictionary

Passing in a list as the sheetname argument of pd.read_excel(), whether you assign the list to a variable holding the sheet_names attribute of a pd.ExcelFile() object or type the list out yourself, constructs a dictionary. In this dictionary, the keys are the names of the sheets, and the values are the DataFrames containing the data from the corresponding sheet. You can extract values from a dictionary by providing a particular key in brackets.

# Create pd.ExcelFile() object
xls = pd.ExcelFile('data/listings.xlsx')

# Extract sheet names and store in exchanges
exchanges = xls.sheet_names

print(exchanges)

# Create listings dictionary with all sheet data
listings = pd.read_excel(xls, sheet_name=exchanges, na_values='n/a')

# Inspect NASDAQ listings
listings['nasdaq'].info()
['amex', 'nasdaq', 'nyse']
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3167 entries, 0 to 3166
Data columns (total 7 columns):
Stock Symbol             3167 non-null object
Company Name             3167 non-null object
Last Sale                3165 non-null float64
Market Capitalization    3167 non-null float64
IPO Year                 1386 non-null float64
Sector                   2767 non-null object
Industry                 2767 non-null object
dtypes: float64(3), object(4)
memory usage: 173.3+ KB

3. Combine multiple excel sheets into one

Combining listings from all three exchanges

# Create the pd.ExcelFile() object
xls = pd.ExcelFile('data/listings.xlsx')

# Extract the sheet names from xls
exchanges = xls.sheet_names

# Create an empty list: listings
listings = []

# Import the data
for exchange in exchanges:
    listing = pd.read_excel(xls, sheet_name=exchange, na_values='n/a')
    listing['Exchange'] = exchange
    listings.append(listing)

# Concatenate the listings: listing_data
listing_data = pd.concat(listings)

# Inspect the results
listing_data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 6674 entries, 0 to 3146
Data columns (total 8 columns):
Stock Symbol             6674 non-null object
Company Name             6674 non-null object
Last Sale                6590 non-null float64
Market Capitalization    6674 non-null float64
IPO Year                 2852 non-null float64
Sector                   5182 non-null object
Industry                 5182 non-null object
Exchange                 6674 non-null object
dtypes: float64(3), object(5)
memory usage: 469.3+ KB
listing_data.sample(5)
Stock Symbol Company Name Last Sale Market Capitalization IPO Year Sector Industry Exchange
262 AHL Aspen Insurance Holdings Limited 50.20 3.009244e+09 2003.0 Finance Property-Casualty Insurers nyse
671 SPTN SpartanNash Company 34.82 1.306422e+09 NaN Consumer Non-Durables Food Distributors nasdaq
2111 TRCH Torchlight Energy Resources, Inc. 1.37 7.927095e+07 NaN Energy Oil & Gas Production nasdaq
2565 SNAP Snap Inc. 20.19 2.336414e+10 2017.0 Technology Computer Software: Programming, Data Processing nyse
335 LPLA LPL Financial Holdings Inc. 37.80 3.404653e+09 2010.0 Finance Investment Bankers/Brokers/Service nasdaq

Exploratory Data Analysis of listings

We will learn how to select the stocks based on criteria such as:

  • Which stock exchange?
  • Which Sector OR Industry?
  • IPO Year
  • Market Capitilization

Or any combination of these.

1. Get ticker for the largest company listed on NYSE

Let’s start by selecting the stock ticker for the largest company by market capitalization listed on the NYSE.

Market capitalization is the combined value of all the company’s shares.

One way to find the largest company is to sort the data by market capitalization using sort_values while passing the parameter ascending equals False to sort in descending order. Select the columns Stock symbol and Company name, and display the first rows using dot-head. You will find that Johnson & Johnson, Exxon, and JP Morgan are the three largest NYSE stocks.

nyse = nyse.sort_values('Market Capitalization', ascending=False)
nyse[['Stock Symbol', 'Company Name']].head(3)
Stock Symbol Company Name
1586 JNJ Johnson & Johnson
1125 XOM Exxon Mobil Corporation
1548 JPM J P Morgan Chase & Co

Using idxmax to find the largest company listed on NYSE:

There’s another, more elegant way to select the ticker of the largest company. First, use set_index to move the stock symbol into the index of the DataFrame. Call dot-info on the new DataFrame, and you’ll now find Stock Symbols in the index, instead of integers. Now that your target is in the index, you can use the method idxmax. Select Market Capitalization as the column that you will use as criterion, and apply idxmax. The method returns the index value that corresponds to the largest market capitalization, which is, of course, the same that we got by sorting by this criterion and selecting the ticker from the first row.

# Import the data
nyse = pd.read_excel('data/listings.xlsx', sheet_name='nyse', na_values='n/a')
nyse = nyse.set_index('Stock Symbol')
nyse.head()
Company Name Last Sale Market Capitalization IPO Year Sector Industry
Stock Symbol
DDD 3D Systems Corporation 14.48 1.647165e+09 NaN Technology Computer Software: Prepackaged Software
MMM 3M Company 188.65 1.127366e+11 NaN Health Care Medical/Dental Instruments
WBAI 500.com Limited 13.96 5.793129e+08 2013.0 Consumer Services Services-Misc. Amusement & Recreation
WUBA 58.com Inc. 36.11 5.225238e+09 2013.0 Technology Computer Software: Programming, Data Processing
AHC A.H. Belo Corporation 6.20 1.347351e+08 NaN Consumer Services Newspapers/Magazines
# Index of max value
nyse['Market Capitalization'].idxmax()
'JNJ'

2. Largest company by sector listed on NYSE

import numpy as np
# unique list of sectors to iterate on
sectors = nyse['Sector'].unique()
# remove nan values if any
sectors = [item for item in sectors if item is not np.nan]

for sector in sectors:
    company = nyse.loc[nyse['Sector'] == sector, 'Market Capitalization'].idxmax()
    print("Larget company in Sector: {} is {}".format(sector, company))
Larget company in Sector: Technology is ORCL
Larget company in Sector: Health Care is JNJ
Larget company in Sector: Consumer Services is WMT
Larget company in Sector: Capital Goods is TM
Larget company in Sector: Consumer Durables is ABB
Larget company in Sector: Finance is JPM
Larget company in Sector: Miscellaneous is BABA
Larget company in Sector: Consumer Non-Durables is KO
Larget company in Sector: Public Utilities is T
Larget company in Sector: Basic Industries is PG
Larget company in Sector: Energy is XOM
Larget company in Sector: Transportation is UPS

3. Largest company by sector that had IPO in 2017 on NYSE

Example of how to use multiple filters

for sector in sectors:
    company = nyse.loc[(nyse['Sector'] == sector) & (nyse['IPO Year'] == 2017), 'Market Capitalization']
    if len(company) > 0:
        ticker = company.idxmax()
        print("Larget company in Sector: {} that had an IPO in year {} is {}".format(sector, 2017, ticker))
Larget company in Sector: Technology that had an IPO in year 2017 is SNAP
Larget company in Sector: Consumer Services that had an IPO in year 2017 is FBM
Larget company in Sector: Capital Goods that had an IPO in year 2017 is REVG
Larget company in Sector: Consumer Durables that had an IPO in year 2017 is ARD
Larget company in Sector: Finance that had an IPO in year 2017 is INVH
Larget company in Sector: Consumer Non-Durables that had an IPO in year 2017 is JILL
Larget company in Sector: Basic Industries that had an IPO in year 2017 is JELD
Larget company in Sector: Energy that had an IPO in year 2017 is JAG
Larget company in Sector: Transportation that had an IPO in year 2017 is AZUL

4. Get the largest consumer company listed after 1998

For this, we will use the combined listings of three stock exchanges that is available to us in listings_data.

# Set Stock Symbol as the index
listings = listing_data.set_index('Stock Symbol')

# Get ticker of the largest consumer services company listed after 1997
ticker = listings.loc[(listings.Sector == 'Consumer Services') &
                      (listings['IPO Year'] > 1998), 'Market Capitalization'].idxmax()

print(ticker)
NFLX

5. Top 5 companies on Nasdaq

# Import the data
nasdaq = pd.read_excel('data/listings.xlsx', sheet_name='nasdaq', na_values='n/a')
nasdaq.set_index('Stock Symbol', inplace=True)

top_5 = nasdaq['Market Capitalization'].nlargest(n=5)

# Market cap in millions of USD
top_5.div(1000000)
top_5.index.to_list()
['AAPL', 'GOOGL', 'GOOG', 'MSFT', 'AMZN']

6. Median market capitalization by sector

The market capitalization data has large outliers. To get a more robust summary of the market value of companies in each sector, you will calculate the median market capitalization by sector.

# Create market_cap_m - in million's of USD
nyse['market_cap_m'] = nyse['Market Capitalization'].div(1e6)

# Drop market cap column
nyse = nyse.drop('Market Capitalization', axis=1)

# Group nyse by sector
mcap_by_sector = nyse.groupby('Sector')

# Calculate median
median_mcap_by_sector = mcap_by_sector.market_cap_m.median()

# Plot and show as horizontal bar chart
median_mcap_by_sector.plot(kind='barh', title='NYSE - Median Market Capitalization', figsize=(20,10))

# Add the label
plt.xlabel('USD mn')

# Show the plot
plt.show()

png

It seems that companies in the Health Care sector in the NYSE have the highest median market capitalization.

7. Median market capitalization by Exchange and Sector

  • We first need to groupby Exchange and Sector
  • Then calculate the median, this will give us a Multi-index Series as we will see.
  • Unstack the multi-index series and use that to plot the bar plot.

NOTE: Whenever you are plotting a numeric column against two category columns, this is the approach you would take.

listing_data['market_cap_m'] = listing_data['Market Capitalization'].div(1e6)
listing_data = listing_data.drop('Market Capitalization', axis=1)
listing_data.head()
Stock Symbol Company Name Last Sale IPO Year Sector Industry Exchange market_cap_m
0 XXII 22nd Century Group, Inc 1.3300 NaN Consumer Non-Durables Farming/Seeds/Milling amex 120.628490
1 FAX Aberdeen Asia-Pacific Income Fund Inc 5.0000 1986.0 NaN NaN amex 1266.332595
2 IAF Aberdeen Australia Equity Fund Inc 6.1500 NaN NaN NaN amex 139.865305
3 CH Aberdeen Chile Fund, Inc. 7.2201 NaN NaN NaN amex 67.563458
4 ABE Aberdeen Emerging Markets Smaller Company Oppo... 13.3600 NaN NaN NaN amex 128.842972
# Group listings by Sector and Exchange
by_sector_exchange = listing_data.groupby(['Sector', 'Exchange'])

# Calculate the median market cap
mcap_by_sector_exchange = by_sector_exchange.market_cap_m.median()

# Display the head of the result
print(type(mcap_by_sector_exchange))
mcap_by_sector_exchange
<class 'pandas.core.series.Series'>





Sector                 Exchange
Basic Industries       amex         148.564581
                       nasdaq       150.385214
                       nyse        2450.240499
Capital Goods          amex          49.596035
                       nasdaq       293.801019
                       nyse        3215.112817
Consumer Durables      amex          18.337396
                       nasdaq       177.593858
                       nyse        2982.585002
Consumer Non-Durables  amex          60.087022
                       nasdaq       229.930660
                       nyse        4701.717926
Consumer Services      amex          44.142866
                       nasdaq       524.151914
                       nyse        1784.206237
Energy                 amex          50.963858
                       nasdaq       182.427776
                       nyse        1838.883585
Finance                amex         238.609251
                       nasdaq       205.476633
                       nyse        2177.155208
Health Care            amex          42.763595
                       nasdaq       173.823325
                       nyse        5879.060322
Miscellaneous          amex          59.131037
                       nasdaq       228.096175
                       nyse        3734.555829
Public Utilities       amex          83.283486
                       nasdaq       378.995786
                       nyse        4087.367768
Technology             amex          39.680756
                       nasdaq       600.720995
                       nyse        2077.509267
Transportation         amex         265.489675
                       nasdaq       479.777469
                       nyse         798.215734
Name: market_cap_m, dtype: float64
# Unstack mcap_by_sector_exchange
mcap_unstacked = mcap_by_sector_exchange.unstack()
mcap_unstacked
Exchange amex nasdaq nyse
Sector
Basic Industries 148.564581 150.385214 2450.240499
Capital Goods 49.596035 293.801019 3215.112817
Consumer Durables 18.337396 177.593858 2982.585002
Consumer Non-Durables 60.087022 229.930660 4701.717926
Consumer Services 44.142866 524.151914 1784.206237
Energy 50.963858 182.427776 1838.883585
Finance 238.609251 205.476633 2177.155208
Health Care 42.763595 173.823325 5879.060322
Miscellaneous 59.131037 228.096175 3734.555829
Public Utilities 83.283486 378.995786 4087.367768
Technology 39.680756 600.720995 2077.509267
Transportation 265.489675 479.777469 798.215734
# Plot as a bar chart
mcap_unstacked.plot(kind='bar', title='Median Market Capitalization by Exchange and Sector', figsize=(20, 10))

# Set the x label
plt.ylabel('USD mn')
plt.xlabel('Sector')

# Show the plot
plt.show()

png

As you can see, the NYSE has the highest median market capitalization for each sector.

8. Number of companies by sector on Nasdaq

Using Seaborn Countplot with one category by sorting the values by count.

import seaborn as sns

nasdaq = pd.read_excel('data/listings.xlsx', sheet_name='nasdaq', na_values='n/a')
sector_size = nasdaq.groupby('Sector').size()
order = sector_size.sort_values(ascending=False)
order = order.index.tolist()

fig, ax = plt.subplots(figsize=(20, 10))

sns.countplot(x='Sector', data=nasdaq, order=order, ax=ax)

plt.title("Number of Observations per Sector")
plt.xticks(rotation=45)
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11]),
 <a list of 12 Text xticklabel objects>)

png

9. Number of companies that went IPO per Sector since 2014

Countplot, multiple categories. Observe that you don’t need to groupby when using countplot as it inherently counts the unique observations per category.

Seaborn will produce a grouped bar chart that shows the number of observations for each sector and IPO year.

recent_ipos = nasdaq[nasdaq['IPO Year'] > 2014]
# to avoid SettingWithCopyWarning
recent_ipos = recent_ipos.copy()
recent_ipos['IPO Year'] = recent_ipos['IPO Year'].astype(int)

fig, ax = plt.subplots(figsize=(20, 10))
sns.countplot(x='Sector', hue='IPO Year', data=recent_ipos, ax=ax)

plt.title("Number of companies that went IPO per Sector since 2014")
plt.xticks(rotation=45)
(array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10]),
 <a list of 11 Text xticklabel objects>)

png

10. Compare market capitalization of companies that went IPO before 2000 vs after 2000

The pointplot is a useful tool to compare statistics for more than one categorical variable. We first create a new variable called IPO that indicates whether the IPO occurred before or after the year 2000. We then add this variable to the pointplot to differentiate the average market cap by sector by this second category. As expected, for most sectors, market capitalization is higher for older companies, but this is not true for all cases.

nasdaq = pd.read_excel('data/listings.xlsx', sheet_name='nasdaq', na_values='n/a')
nasdaq['market_cap_m'] = nasdaq['Market Capitalization'].div(1e6)
nasdaq = nasdaq.drop('Market Capitalization', axis=1)

nasdaq['IPO'] = nasdaq['IPO Year'].apply(lambda x: 'After 2000' if x > 2000 else 'Before 2000')

fig, ax = plt.subplots(figsize=(20, 10))
sns.pointplot(x='Sector', y='market_cap_m', hue='IPO', data=nasdaq, ax=ax)
plt.xticks(rotation=45)
plt.title('Market Cap (USDm)')
Text(0.5, 1.0, 'Market Cap (USDm)')

png

Tags:

Updated: