Manipulation DataFrames using Pandas

10 minute read

Grouping and Aggregating dataframes

The Olympic medal data for the following exercises comes from https://assets.datacamp.com/production/course_1650/datasets/all_medalists.csv . It comprises records of all events held at the Olympic games between 1896 and 2012. Suppose you have loaded the data into a DataFrame medals. You now want to find the total number of medals awarded to the USA per edition.

import pandas as pd
medals = pd.read_csv('https://assets.datacamp.com/production/course_1650/datasets/all_medalists.csv')
medals.head()
City Edition Sport Discipline Athlete NOC Gender Event Event_gender Medal
0 Athens 1896 Aquatics Swimming HAJOS, Alfred HUN Men 100m freestyle M Gold
1 Athens 1896 Aquatics Swimming HERSCHMANN, Otto AUT Men 100m freestyle M Silver
2 Athens 1896 Aquatics Swimming DRIVAS, Dimitrios GRE Men 100m freestyle for sailors M Bronze
3 Athens 1896 Aquatics Swimming MALOKINIS, Ioannis GRE Men 100m freestyle for sailors M Gold
4 Athens 1896 Aquatics Swimming CHASAPIS, Spiridon GRE Men 100m freestyle for sailors M Silver
medals.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29216 entries, 0 to 29215
Data columns (total 10 columns):
City            29216 non-null object
Edition         29216 non-null int64
Sport           29216 non-null object
Discipline      29216 non-null object
Athlete         29216 non-null object
NOC             29216 non-null object
Gender          29216 non-null object
Event           29216 non-null object
Event_gender    29216 non-null object
Medal           29216 non-null object
dtypes: int64(1), object(9)
memory usage: 2.2+ MB

The columns NOC represents the country.

medals.loc[medals.NOC == "USA"].groupby('Edition')['Medal'].count()
Edition
1896     20
1900     55
1904    394
1908     63
1912    101
1920    193
1924    198
1928     84
1932    181
1936     92
1948    148
1952    130
1956    118
1960    112
1964    150
1968    149
1972    155
1976    155
1984    333
1988    193
1992    224
1996    260
2000    248
2004    264
2008    315
Name: Medal, dtype: int64

What are the top 15 countries ranked by total number of medals ?

For this, we can use the pandas Series method .value_counts() to determine the top 15 countries ranked by total number of medals.

# Select the 'NOC' column of medals: country_names
country_names = medals['NOC']

# Count the number of medals won by each country: medal_counts
medal_counts = country_names.value_counts()

# Print top 15 countries ranked by medals
print(medal_counts.head(15))
USA    4335
URS    2049
GBR    1594
FRA    1314
ITA    1228
GER    1211
AUS    1075
HUN    1053
SWE    1021
GDR     825
NED     782
JPN     704
CHN     679
RUS     638
ROU     624
Name: NOC, dtype: int64

Using .pivot_table() to count medals by type

Let’s see how using .pivot_table() helps in summarizing the data. Take a look at the head().

medals.head()
City Edition Sport Discipline Athlete NOC Gender Event Event_gender Medal
0 Athens 1896 Aquatics Swimming HAJOS, Alfred HUN Men 100m freestyle M Gold
1 Athens 1896 Aquatics Swimming HERSCHMANN, Otto AUT Men 100m freestyle M Silver
2 Athens 1896 Aquatics Swimming DRIVAS, Dimitrios GRE Men 100m freestyle for sailors M Bronze
3 Athens 1896 Aquatics Swimming MALOKINIS, Ioannis GRE Men 100m freestyle for sailors M Gold
4 Athens 1896 Aquatics Swimming CHASAPIS, Spiridon GRE Men 100m freestyle for sailors M Silver

pivot_table is used when you want to see how a particular column relates to another column. For example, here we want to see how each country (NOC) column relates to the Medal column. Another example is if you want to know how the medals are distributed amoung men and women for each country, you can do this:

medals.pivot_table(index='NOC', columns='Gender', values = 'Athlete', aggfunc='count')
Gender Men Women
NOC
AFG 1.0 NaN
AHO 1.0 NaN
ALG 11.0 3.0
ANZ 27.0 2.0
ARG 183.0 56.0
ARM 9.0 NaN
AUS 647.0 428.0
AUT 125.0 21.0
AZE 12.0 4.0
BAH 10.0 13.0
BAR 1.0 NaN
BDI 1.0 NaN
BEL 390.0 18.0
BER 1.0 NaN
BLR 45.0 47.0
BOH 6.0 1.0
BRA 262.0 110.0
BUL 214.0 117.0
BWI 5.0 NaN
CAN 411.0 181.0
CHI 32.0 1.0
CHN 218.0 461.0
CIV 1.0 NaN
CMR 20.0 2.0
COL 6.0 5.0
CRC NaN 4.0
CRO 75.0 4.0
CUB 299.0 96.0
CZE 29.0 12.0
DEN 402.0 89.0
... ... ...
SRI 1.0 1.0
SUD 1.0 NaN
SUI 350.0 26.0
SUR 2.0 NaN
SVK 24.0 5.0
SWE 932.0 89.0
SYR 2.0 1.0
TAN 2.0 NaN
TCH 249.0 80.0
TGA 1.0 NaN
THA 13.0 8.0
TJK 2.0 NaN
TOG 1.0 NaN
TPE 30.0 12.0
TRI 20.0 NaN
TUN 7.0 NaN
TUR 75.0 7.0
UAE 1.0 NaN
UGA 6.0 NaN
UKR 75.0 73.0
URS 1476.0 573.0
URU 76.0 NaN
USA 3120.0 1215.0
UZB 16.0 1.0
VEN 9.0 2.0
VIE 1.0 1.0
YUG 373.0 62.0
ZAM 2.0 NaN
ZIM NaN 23.0
ZZX 45.0 3.0

138 rows × 2 columns

Let’s say you are now interested in finding out which countries have women won as many or more medals than men To answer this question, we need to take the above output and then add another boolean Series/column which is True for a country where women won more medals than men. Then, use that column to filter the dataframe to find the countries where women won more medals.

genderCounts = medals.pivot_table(index='NOC', columns='Gender', values = 'Athlete', aggfunc='count')
genderCounts.loc[genderCounts['Women'] >= genderCounts['Men']]
Gender Men Women
NOC
BAH 10.0 13.0
BLR 45.0 47.0
CHN 218.0 461.0
IOP 1.0 2.0
JAM 47.0 55.0
PER 3.0 12.0
ROU 299.0 325.0
SIN 1.0 3.0
SRI 1.0 1.0
VIE 1.0 1.0

While this is true for the countries listed above, there are some countries which have NAN values, example ZIM. These are lost during comparison. Hence, we must first replace NAN values from our genderCounts dataframe with 0’s.

genderCounts.fillna(value=0, inplace=True)

Now, if we run our comparison, we can get the countries where women won as many or more medals than men.

genderCounts.loc[genderCounts['Women'] >= genderCounts['Men']]
Gender Men Women
NOC
BAH 10.0 13.0
BLR 45.0 47.0
CHN 218.0 461.0
CRC 0.0 4.0
IOP 1.0 2.0
JAM 47.0 55.0
MOZ 0.0 2.0
PER 3.0 12.0
ROU 299.0 325.0
SIN 1.0 3.0
SRI 1.0 1.0
VIE 1.0 1.0
ZIM 0.0 23.0

Now, we see that there are 3 more countries added to the list, namely, ZIM, MOZ and CRC.

Next, I am interested in knowing in which countries women won more medals than men ?

medals.pivot_table(index='NOC', columns=['Gender', 'Medal'], values = 'Athlete', aggfunc='count')
Gender Men Women
Medal Bronze Gold Silver Bronze Gold Silver
NOC
AFG 1.0 NaN NaN NaN NaN NaN
AHO NaN NaN 1.0 NaN NaN NaN
ALG 7.0 2.0 2.0 1.0 2.0 NaN
ANZ 5.0 19.0 3.0 NaN 1.0 1.0
ARG 51.0 68.0 64.0 37.0 NaN 19.0
ARM 7.0 1.0 1.0 NaN NaN NaN
AUS 263.0 148.0 236.0 150.0 145.0 133.0
AUT 31.0 17.0 77.0 13.0 4.0 4.0
AZE 6.0 3.0 3.0 3.0 1.0 NaN
BAH 4.0 2.0 4.0 1.0 7.0 5.0
BAR 1.0 NaN NaN NaN NaN NaN
BDI NaN 1.0 NaN NaN NaN NaN
BEL 141.0 88.0 161.0 9.0 3.0 6.0
BER 1.0 NaN NaN NaN NaN NaN
BLR 24.0 8.0 13.0 29.0 6.0 12.0
BOH 5.0 NaN 1.0 1.0 NaN NaN
BRA 132.0 44.0 86.0 42.0 15.0 53.0
BUL 80.0 40.0 94.0 56.0 13.0 48.0
BWI 5.0 NaN NaN NaN NaN NaN
CAN 144.0 122.0 145.0 83.0 32.0 66.0
CHI 21.0 3.0 8.0 NaN NaN 1.0
CHN 48.0 95.0 75.0 145.0 139.0 177.0
CIV NaN NaN 1.0 NaN NaN NaN
CMR 1.0 18.0 1.0 NaN 2.0 NaN
COL 3.0 NaN 3.0 4.0 1.0 NaN
CRC NaN NaN NaN 2.0 1.0 1.0
CRO 15.0 31.0 29.0 3.0 NaN 1.0
CUB 76.0 116.0 107.0 33.0 44.0 19.0
CZE 10.0 6.0 13.0 3.0 4.0 5.0
DEN 130.0 95.0 177.0 22.0 52.0 15.0
... ... ... ... ... ... ...
SRI NaN NaN 1.0 NaN NaN 1.0
SUD NaN NaN 1.0 NaN NaN NaN
SUI 130.0 71.0 149.0 8.0 2.0 16.0
SUR 1.0 1.0 NaN NaN NaN NaN
SVK 8.0 8.0 8.0 NaN 2.0 3.0
SWE 284.0 328.0 320.0 41.0 19.0 29.0
SYR 1.0 NaN 1.0 NaN 1.0 NaN
TAN NaN NaN 2.0 NaN NaN NaN
TCH 94.0 60.0 95.0 11.0 20.0 49.0
TGA NaN NaN 1.0 NaN NaN NaN
THA 6.0 4.0 3.0 4.0 3.0 1.0
TJK 1.0 NaN 1.0 NaN NaN NaN
TOG 1.0 NaN NaN NaN NaN NaN
TPE 4.0 1.0 25.0 9.0 1.0 2.0
TRI 11.0 1.0 8.0 NaN NaN NaN
TUN 3.0 2.0 2.0 NaN NaN NaN
TUR 20.0 36.0 19.0 2.0 1.0 4.0
UAE NaN 1.0 NaN NaN NaN NaN
UGA 2.0 1.0 3.0 NaN NaN NaN
UKR 38.0 15.0 22.0 40.0 17.0 16.0
URS 419.0 575.0 482.0 165.0 263.0 145.0
URU 30.0 44.0 2.0 NaN NaN NaN
USA 766.0 1520.0 834.0 286.0 568.0 361.0
UZB 7.0 4.0 5.0 1.0 NaN NaN
VEN 6.0 1.0 2.0 2.0 NaN NaN
VIE NaN NaN 1.0 NaN NaN 1.0
YUG 102.0 125.0 146.0 16.0 18.0 28.0
ZAM 1.0 NaN 1.0 NaN NaN NaN
ZIM NaN NaN NaN 1.0 18.0 4.0
ZZX 8.0 23.0 14.0 2.0 NaN 1.0

138 rows × 6 columns

genderGold = medals.pivot_table(index='NOC', columns=['Gender', 'Medal'], values = 'Athlete', aggfunc='count')

Here, the goal is the access the Multi-Index column and select only Gold column of the second level. This can be done as follows:

genderGold.loc[:, (['Men', 'Women'], 'Gold')]

# or using slice(None) to specify all columns.
#genderGold.loc[:, (slice(None), 'Gold')]
Gender Men Women
Medal Gold Gold
NOC
AFG NaN NaN
AHO NaN NaN
ALG 2.0 2.0
ANZ 19.0 1.0
ARG 68.0 NaN
ARM 1.0 NaN
AUS 148.0 145.0
AUT 17.0 4.0
AZE 3.0 1.0
BAH 2.0 7.0
BAR NaN NaN
BDI 1.0 NaN
BEL 88.0 3.0
BER NaN NaN
BLR 8.0 6.0
BOH NaN NaN
BRA 44.0 15.0
BUL 40.0 13.0
BWI NaN NaN
CAN 122.0 32.0
CHI 3.0 NaN
CHN 95.0 139.0
CIV NaN NaN
CMR 18.0 2.0
COL NaN 1.0
CRC NaN 1.0
CRO 31.0 NaN
CUB 116.0 44.0
CZE 6.0 4.0
DEN 95.0 52.0
... ... ...
SRI NaN NaN
SUD NaN NaN
SUI 71.0 2.0
SUR 1.0 NaN
SVK 8.0 2.0
SWE 328.0 19.0
SYR NaN 1.0
TAN NaN NaN
TCH 60.0 20.0
TGA NaN NaN
THA 4.0 3.0
TJK NaN NaN
TOG NaN NaN
TPE 1.0 1.0
TRI 1.0 NaN
TUN 2.0 NaN
TUR 36.0 1.0
UAE 1.0 NaN
UGA 1.0 NaN
UKR 15.0 17.0
URS 575.0 263.0
URU 44.0 NaN
USA 1520.0 568.0
UZB 4.0 NaN
VEN 1.0 NaN
VIE NaN NaN
YUG 125.0 18.0
ZAM NaN NaN
ZIM NaN 18.0
ZZX 23.0 NaN

138 rows × 2 columns

# store this into a dataframe
genderGold = genderGold.loc[:, (['Men', 'Women'], 'Gold')]

# Replace NaNs with 0
genderGold.fillna(value=0, inplace=True)

genderGold.head()
Gender Men Women
Medal Gold Gold
NOC
AFG 0.0 0.0
AHO 0.0 0.0
ALG 2.0 2.0
ANZ 19.0 1.0
ARG 68.0 0.0
# Filter rows which have the desired criteria
# For the sake of simplicity, create a boolean Series
womenGold = genderGold.loc[:,(['Women'], 'Gold')]
womenGold.info()
<class 'pandas.core.frame.DataFrame'>
Index: 138 entries, AFG to ZZX
Data columns (total 1 columns):
(Women, Gold)    138 non-null float64
dtypes: float64(1)
memory usage: 2.2+ KB
# Filter rows which have the desired criteria
# For the sake of simplicity, create a boolean Series
# genderGold.loc[:, (['Women'], 'Gold')] >= genderGold.loc[:, (['Men'], 'Gold')]
medals.pivot_table(index='NOC', columns='Medal', values= 'Athlete', aggfunc='count')
Medal Bronze Gold Silver
NOC
AFG 1.0 NaN NaN
AHO NaN NaN 1.0
ALG 8.0 4.0 2.0
ANZ 5.0 20.0 4.0
ARG 88.0 68.0 83.0
ARM 7.0 1.0 1.0
AUS 413.0 293.0 369.0
AUT 44.0 21.0 81.0
AZE 9.0 4.0 3.0
BAH 5.0 9.0 9.0
BAR 1.0 NaN NaN
BDI NaN 1.0 NaN
BEL 150.0 91.0 167.0
BER 1.0 NaN NaN
BLR 53.0 14.0 25.0
BOH 6.0 NaN 1.0
BRA 174.0 59.0 139.0
BUL 136.0 53.0 142.0
BWI 5.0 NaN NaN
CAN 227.0 154.0 211.0
CHI 21.0 3.0 9.0
CHN 193.0 234.0 252.0
CIV NaN NaN 1.0
CMR 1.0 20.0 1.0
COL 7.0 1.0 3.0
CRC 2.0 1.0 1.0
CRO 18.0 31.0 30.0
CUB 109.0 160.0 126.0
CZE 13.0 10.0 18.0
DEN 152.0 147.0 192.0
... ... ... ...
SRI NaN NaN 2.0
SUD NaN NaN 1.0
SUI 138.0 73.0 165.0
SUR 1.0 1.0 NaN
SVK 8.0 10.0 11.0
SWE 325.0 347.0 349.0
SYR 1.0 1.0 1.0
TAN NaN NaN 2.0
TCH 105.0 80.0 144.0
TGA NaN NaN 1.0
THA 10.0 7.0 4.0
TJK 1.0 NaN 1.0
TOG 1.0 NaN NaN
TPE 13.0 2.0 27.0
TRI 11.0 1.0 8.0
TUN 3.0 2.0 2.0
TUR 22.0 37.0 23.0
UAE NaN 1.0 NaN
UGA 2.0 1.0 3.0
UKR 78.0 32.0 38.0
URS 584.0 838.0 627.0
URU 30.0 44.0 2.0
USA 1052.0 2088.0 1195.0
UZB 8.0 4.0 5.0
VEN 8.0 1.0 2.0
VIE NaN NaN 2.0
YUG 118.0 143.0 174.0
ZAM 1.0 NaN 1.0
ZIM 1.0 18.0 4.0
ZZX 10.0 23.0 15.0

138 rows × 3 columns

# Construct the pivot table: counted
counted = medals.pivot_table (index='NOC', columns='Medal', values='Athlete', aggfunc='count')

# Create the new column: counted['totals']
counted['totals'] = counted.sum(axis='columns')

# Sort counted by the 'totals' column
counted = counted.sort_values('totals', ascending=False)

# Print the top 15 rows of counted
print(counted.head(15))
Medal  Bronze    Gold  Silver  totals
NOC                                  
USA    1052.0  2088.0  1195.0  4335.0
URS     584.0   838.0   627.0  2049.0
GBR     505.0   498.0   591.0  1594.0
FRA     475.0   378.0   461.0  1314.0
ITA     374.0   460.0   394.0  1228.0
GER     454.0   407.0   350.0  1211.0
AUS     413.0   293.0   369.0  1075.0
HUN     345.0   400.0   308.0  1053.0
SWE     325.0   347.0   349.0  1021.0
GDR     225.0   329.0   271.0   825.0
NED     320.0   212.0   250.0   782.0
JPN     270.0   206.0   228.0   704.0
CHN     193.0   234.0   252.0   679.0
RUS     240.0   192.0   206.0   638.0
ROU     282.0   155.0   187.0   624.0

Tags:

Updated: