Manipulation DataFrames using Pandas
Grouping and Aggregating dataframes
The Olympic medal data for the following exercises comes from https://assets.datacamp.com/production/course_1650/datasets/all_medalists.csv . It comprises records of all events held at the Olympic games between 1896 and 2012. Suppose you have loaded the data into a DataFrame medals. You now want to find the total number of medals awarded to the USA per edition.
import pandas as pd
medals = pd.read_csv('https://assets.datacamp.com/production/course_1650/datasets/all_medalists.csv')
medals.head()
City | Edition | Sport | Discipline | Athlete | NOC | Gender | Event | Event_gender | Medal | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Athens | 1896 | Aquatics | Swimming | HAJOS, Alfred | HUN | Men | 100m freestyle | M | Gold |
1 | Athens | 1896 | Aquatics | Swimming | HERSCHMANN, Otto | AUT | Men | 100m freestyle | M | Silver |
2 | Athens | 1896 | Aquatics | Swimming | DRIVAS, Dimitrios | GRE | Men | 100m freestyle for sailors | M | Bronze |
3 | Athens | 1896 | Aquatics | Swimming | MALOKINIS, Ioannis | GRE | Men | 100m freestyle for sailors | M | Gold |
4 | Athens | 1896 | Aquatics | Swimming | CHASAPIS, Spiridon | GRE | Men | 100m freestyle for sailors | M | Silver |
medals.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 29216 entries, 0 to 29215
Data columns (total 10 columns):
City 29216 non-null object
Edition 29216 non-null int64
Sport 29216 non-null object
Discipline 29216 non-null object
Athlete 29216 non-null object
NOC 29216 non-null object
Gender 29216 non-null object
Event 29216 non-null object
Event_gender 29216 non-null object
Medal 29216 non-null object
dtypes: int64(1), object(9)
memory usage: 2.2+ MB
The columns NOC
represents the country.
medals.loc[medals.NOC == "USA"].groupby('Edition')['Medal'].count()
Edition
1896 20
1900 55
1904 394
1908 63
1912 101
1920 193
1924 198
1928 84
1932 181
1936 92
1948 148
1952 130
1956 118
1960 112
1964 150
1968 149
1972 155
1976 155
1984 333
1988 193
1992 224
1996 260
2000 248
2004 264
2008 315
Name: Medal, dtype: int64
What are the top 15 countries ranked by total number of medals ?
For this, we can use the pandas Series method .value_counts()
to determine the top 15 countries ranked by total number of medals.
# Select the 'NOC' column of medals: country_names
country_names = medals['NOC']
# Count the number of medals won by each country: medal_counts
medal_counts = country_names.value_counts()
# Print top 15 countries ranked by medals
print(medal_counts.head(15))
USA 4335
URS 2049
GBR 1594
FRA 1314
ITA 1228
GER 1211
AUS 1075
HUN 1053
SWE 1021
GDR 825
NED 782
JPN 704
CHN 679
RUS 638
ROU 624
Name: NOC, dtype: int64
Using .pivot_table()
to count medals by type
Let’s see how using .pivot_table()
helps in summarizing the data. Take a look at the head().
medals.head()
City | Edition | Sport | Discipline | Athlete | NOC | Gender | Event | Event_gender | Medal | |
---|---|---|---|---|---|---|---|---|---|---|
0 | Athens | 1896 | Aquatics | Swimming | HAJOS, Alfred | HUN | Men | 100m freestyle | M | Gold |
1 | Athens | 1896 | Aquatics | Swimming | HERSCHMANN, Otto | AUT | Men | 100m freestyle | M | Silver |
2 | Athens | 1896 | Aquatics | Swimming | DRIVAS, Dimitrios | GRE | Men | 100m freestyle for sailors | M | Bronze |
3 | Athens | 1896 | Aquatics | Swimming | MALOKINIS, Ioannis | GRE | Men | 100m freestyle for sailors | M | Gold |
4 | Athens | 1896 | Aquatics | Swimming | CHASAPIS, Spiridon | GRE | Men | 100m freestyle for sailors | M | Silver |
pivot_table
is used when you want to see how a particular column relates to another column. For example, here we want to see how each country (NOC
) column relates to the Medal
column. Another example is if you want to know how the medals are distributed amoung men and women for each country, you can do this:
medals.pivot_table(index='NOC', columns='Gender', values = 'Athlete', aggfunc='count')
Gender | Men | Women |
---|---|---|
NOC | ||
AFG | 1.0 | NaN |
AHO | 1.0 | NaN |
ALG | 11.0 | 3.0 |
ANZ | 27.0 | 2.0 |
ARG | 183.0 | 56.0 |
ARM | 9.0 | NaN |
AUS | 647.0 | 428.0 |
AUT | 125.0 | 21.0 |
AZE | 12.0 | 4.0 |
BAH | 10.0 | 13.0 |
BAR | 1.0 | NaN |
BDI | 1.0 | NaN |
BEL | 390.0 | 18.0 |
BER | 1.0 | NaN |
BLR | 45.0 | 47.0 |
BOH | 6.0 | 1.0 |
BRA | 262.0 | 110.0 |
BUL | 214.0 | 117.0 |
BWI | 5.0 | NaN |
CAN | 411.0 | 181.0 |
CHI | 32.0 | 1.0 |
CHN | 218.0 | 461.0 |
CIV | 1.0 | NaN |
CMR | 20.0 | 2.0 |
COL | 6.0 | 5.0 |
CRC | NaN | 4.0 |
CRO | 75.0 | 4.0 |
CUB | 299.0 | 96.0 |
CZE | 29.0 | 12.0 |
DEN | 402.0 | 89.0 |
... | ... | ... |
SRI | 1.0 | 1.0 |
SUD | 1.0 | NaN |
SUI | 350.0 | 26.0 |
SUR | 2.0 | NaN |
SVK | 24.0 | 5.0 |
SWE | 932.0 | 89.0 |
SYR | 2.0 | 1.0 |
TAN | 2.0 | NaN |
TCH | 249.0 | 80.0 |
TGA | 1.0 | NaN |
THA | 13.0 | 8.0 |
TJK | 2.0 | NaN |
TOG | 1.0 | NaN |
TPE | 30.0 | 12.0 |
TRI | 20.0 | NaN |
TUN | 7.0 | NaN |
TUR | 75.0 | 7.0 |
UAE | 1.0 | NaN |
UGA | 6.0 | NaN |
UKR | 75.0 | 73.0 |
URS | 1476.0 | 573.0 |
URU | 76.0 | NaN |
USA | 3120.0 | 1215.0 |
UZB | 16.0 | 1.0 |
VEN | 9.0 | 2.0 |
VIE | 1.0 | 1.0 |
YUG | 373.0 | 62.0 |
ZAM | 2.0 | NaN |
ZIM | NaN | 23.0 |
ZZX | 45.0 | 3.0 |
138 rows × 2 columns
Let’s say you are now interested in finding out which countries have women won as many or more medals than men To answer this question, we need to take the above output and then add another boolean Series/column which is True for a country where women won more medals than men. Then, use that column to filter the dataframe to find the countries where women won more medals.
genderCounts = medals.pivot_table(index='NOC', columns='Gender', values = 'Athlete', aggfunc='count')
genderCounts.loc[genderCounts['Women'] >= genderCounts['Men']]
Gender | Men | Women |
---|---|---|
NOC | ||
BAH | 10.0 | 13.0 |
BLR | 45.0 | 47.0 |
CHN | 218.0 | 461.0 |
IOP | 1.0 | 2.0 |
JAM | 47.0 | 55.0 |
PER | 3.0 | 12.0 |
ROU | 299.0 | 325.0 |
SIN | 1.0 | 3.0 |
SRI | 1.0 | 1.0 |
VIE | 1.0 | 1.0 |
While this is true for the countries listed above, there are some countries which have NAN
values, example ZIM
. These are lost during comparison. Hence, we must first replace NAN
values from our genderCounts
dataframe with 0’s.
genderCounts.fillna(value=0, inplace=True)
Now, if we run our comparison, we can get the countries where women won as many or more medals than men.
genderCounts.loc[genderCounts['Women'] >= genderCounts['Men']]
Gender | Men | Women |
---|---|---|
NOC | ||
BAH | 10.0 | 13.0 |
BLR | 45.0 | 47.0 |
CHN | 218.0 | 461.0 |
CRC | 0.0 | 4.0 |
IOP | 1.0 | 2.0 |
JAM | 47.0 | 55.0 |
MOZ | 0.0 | 2.0 |
PER | 3.0 | 12.0 |
ROU | 299.0 | 325.0 |
SIN | 1.0 | 3.0 |
SRI | 1.0 | 1.0 |
VIE | 1.0 | 1.0 |
ZIM | 0.0 | 23.0 |
Now, we see that there are 3 more countries added to the list, namely, ZIM
, MOZ
and CRC
.
Next, I am interested in knowing in which countries women won more medals than men ?
medals.pivot_table(index='NOC', columns=['Gender', 'Medal'], values = 'Athlete', aggfunc='count')
Gender | Men | Women | ||||
---|---|---|---|---|---|---|
Medal | Bronze | Gold | Silver | Bronze | Gold | Silver |
NOC | ||||||
AFG | 1.0 | NaN | NaN | NaN | NaN | NaN |
AHO | NaN | NaN | 1.0 | NaN | NaN | NaN |
ALG | 7.0 | 2.0 | 2.0 | 1.0 | 2.0 | NaN |
ANZ | 5.0 | 19.0 | 3.0 | NaN | 1.0 | 1.0 |
ARG | 51.0 | 68.0 | 64.0 | 37.0 | NaN | 19.0 |
ARM | 7.0 | 1.0 | 1.0 | NaN | NaN | NaN |
AUS | 263.0 | 148.0 | 236.0 | 150.0 | 145.0 | 133.0 |
AUT | 31.0 | 17.0 | 77.0 | 13.0 | 4.0 | 4.0 |
AZE | 6.0 | 3.0 | 3.0 | 3.0 | 1.0 | NaN |
BAH | 4.0 | 2.0 | 4.0 | 1.0 | 7.0 | 5.0 |
BAR | 1.0 | NaN | NaN | NaN | NaN | NaN |
BDI | NaN | 1.0 | NaN | NaN | NaN | NaN |
BEL | 141.0 | 88.0 | 161.0 | 9.0 | 3.0 | 6.0 |
BER | 1.0 | NaN | NaN | NaN | NaN | NaN |
BLR | 24.0 | 8.0 | 13.0 | 29.0 | 6.0 | 12.0 |
BOH | 5.0 | NaN | 1.0 | 1.0 | NaN | NaN |
BRA | 132.0 | 44.0 | 86.0 | 42.0 | 15.0 | 53.0 |
BUL | 80.0 | 40.0 | 94.0 | 56.0 | 13.0 | 48.0 |
BWI | 5.0 | NaN | NaN | NaN | NaN | NaN |
CAN | 144.0 | 122.0 | 145.0 | 83.0 | 32.0 | 66.0 |
CHI | 21.0 | 3.0 | 8.0 | NaN | NaN | 1.0 |
CHN | 48.0 | 95.0 | 75.0 | 145.0 | 139.0 | 177.0 |
CIV | NaN | NaN | 1.0 | NaN | NaN | NaN |
CMR | 1.0 | 18.0 | 1.0 | NaN | 2.0 | NaN |
COL | 3.0 | NaN | 3.0 | 4.0 | 1.0 | NaN |
CRC | NaN | NaN | NaN | 2.0 | 1.0 | 1.0 |
CRO | 15.0 | 31.0 | 29.0 | 3.0 | NaN | 1.0 |
CUB | 76.0 | 116.0 | 107.0 | 33.0 | 44.0 | 19.0 |
CZE | 10.0 | 6.0 | 13.0 | 3.0 | 4.0 | 5.0 |
DEN | 130.0 | 95.0 | 177.0 | 22.0 | 52.0 | 15.0 |
... | ... | ... | ... | ... | ... | ... |
SRI | NaN | NaN | 1.0 | NaN | NaN | 1.0 |
SUD | NaN | NaN | 1.0 | NaN | NaN | NaN |
SUI | 130.0 | 71.0 | 149.0 | 8.0 | 2.0 | 16.0 |
SUR | 1.0 | 1.0 | NaN | NaN | NaN | NaN |
SVK | 8.0 | 8.0 | 8.0 | NaN | 2.0 | 3.0 |
SWE | 284.0 | 328.0 | 320.0 | 41.0 | 19.0 | 29.0 |
SYR | 1.0 | NaN | 1.0 | NaN | 1.0 | NaN |
TAN | NaN | NaN | 2.0 | NaN | NaN | NaN |
TCH | 94.0 | 60.0 | 95.0 | 11.0 | 20.0 | 49.0 |
TGA | NaN | NaN | 1.0 | NaN | NaN | NaN |
THA | 6.0 | 4.0 | 3.0 | 4.0 | 3.0 | 1.0 |
TJK | 1.0 | NaN | 1.0 | NaN | NaN | NaN |
TOG | 1.0 | NaN | NaN | NaN | NaN | NaN |
TPE | 4.0 | 1.0 | 25.0 | 9.0 | 1.0 | 2.0 |
TRI | 11.0 | 1.0 | 8.0 | NaN | NaN | NaN |
TUN | 3.0 | 2.0 | 2.0 | NaN | NaN | NaN |
TUR | 20.0 | 36.0 | 19.0 | 2.0 | 1.0 | 4.0 |
UAE | NaN | 1.0 | NaN | NaN | NaN | NaN |
UGA | 2.0 | 1.0 | 3.0 | NaN | NaN | NaN |
UKR | 38.0 | 15.0 | 22.0 | 40.0 | 17.0 | 16.0 |
URS | 419.0 | 575.0 | 482.0 | 165.0 | 263.0 | 145.0 |
URU | 30.0 | 44.0 | 2.0 | NaN | NaN | NaN |
USA | 766.0 | 1520.0 | 834.0 | 286.0 | 568.0 | 361.0 |
UZB | 7.0 | 4.0 | 5.0 | 1.0 | NaN | NaN |
VEN | 6.0 | 1.0 | 2.0 | 2.0 | NaN | NaN |
VIE | NaN | NaN | 1.0 | NaN | NaN | 1.0 |
YUG | 102.0 | 125.0 | 146.0 | 16.0 | 18.0 | 28.0 |
ZAM | 1.0 | NaN | 1.0 | NaN | NaN | NaN |
ZIM | NaN | NaN | NaN | 1.0 | 18.0 | 4.0 |
ZZX | 8.0 | 23.0 | 14.0 | 2.0 | NaN | 1.0 |
138 rows × 6 columns
genderGold = medals.pivot_table(index='NOC', columns=['Gender', 'Medal'], values = 'Athlete', aggfunc='count')
Here, the goal is the access the Multi-Index column and select only Gold column of the second level. This can be done as follows:
genderGold.loc[:, (['Men', 'Women'], 'Gold')]
# or using slice(None) to specify all columns.
#genderGold.loc[:, (slice(None), 'Gold')]
Gender | Men | Women |
---|---|---|
Medal | Gold | Gold |
NOC | ||
AFG | NaN | NaN |
AHO | NaN | NaN |
ALG | 2.0 | 2.0 |
ANZ | 19.0 | 1.0 |
ARG | 68.0 | NaN |
ARM | 1.0 | NaN |
AUS | 148.0 | 145.0 |
AUT | 17.0 | 4.0 |
AZE | 3.0 | 1.0 |
BAH | 2.0 | 7.0 |
BAR | NaN | NaN |
BDI | 1.0 | NaN |
BEL | 88.0 | 3.0 |
BER | NaN | NaN |
BLR | 8.0 | 6.0 |
BOH | NaN | NaN |
BRA | 44.0 | 15.0 |
BUL | 40.0 | 13.0 |
BWI | NaN | NaN |
CAN | 122.0 | 32.0 |
CHI | 3.0 | NaN |
CHN | 95.0 | 139.0 |
CIV | NaN | NaN |
CMR | 18.0 | 2.0 |
COL | NaN | 1.0 |
CRC | NaN | 1.0 |
CRO | 31.0 | NaN |
CUB | 116.0 | 44.0 |
CZE | 6.0 | 4.0 |
DEN | 95.0 | 52.0 |
... | ... | ... |
SRI | NaN | NaN |
SUD | NaN | NaN |
SUI | 71.0 | 2.0 |
SUR | 1.0 | NaN |
SVK | 8.0 | 2.0 |
SWE | 328.0 | 19.0 |
SYR | NaN | 1.0 |
TAN | NaN | NaN |
TCH | 60.0 | 20.0 |
TGA | NaN | NaN |
THA | 4.0 | 3.0 |
TJK | NaN | NaN |
TOG | NaN | NaN |
TPE | 1.0 | 1.0 |
TRI | 1.0 | NaN |
TUN | 2.0 | NaN |
TUR | 36.0 | 1.0 |
UAE | 1.0 | NaN |
UGA | 1.0 | NaN |
UKR | 15.0 | 17.0 |
URS | 575.0 | 263.0 |
URU | 44.0 | NaN |
USA | 1520.0 | 568.0 |
UZB | 4.0 | NaN |
VEN | 1.0 | NaN |
VIE | NaN | NaN |
YUG | 125.0 | 18.0 |
ZAM | NaN | NaN |
ZIM | NaN | 18.0 |
ZZX | 23.0 | NaN |
138 rows × 2 columns
# store this into a dataframe
genderGold = genderGold.loc[:, (['Men', 'Women'], 'Gold')]
# Replace NaNs with 0
genderGold.fillna(value=0, inplace=True)
genderGold.head()
Gender | Men | Women |
---|---|---|
Medal | Gold | Gold |
NOC | ||
AFG | 0.0 | 0.0 |
AHO | 0.0 | 0.0 |
ALG | 2.0 | 2.0 |
ANZ | 19.0 | 1.0 |
ARG | 68.0 | 0.0 |
# Filter rows which have the desired criteria
# For the sake of simplicity, create a boolean Series
womenGold = genderGold.loc[:,(['Women'], 'Gold')]
womenGold.info()
<class 'pandas.core.frame.DataFrame'>
Index: 138 entries, AFG to ZZX
Data columns (total 1 columns):
(Women, Gold) 138 non-null float64
dtypes: float64(1)
memory usage: 2.2+ KB
# Filter rows which have the desired criteria
# For the sake of simplicity, create a boolean Series
# genderGold.loc[:, (['Women'], 'Gold')] >= genderGold.loc[:, (['Men'], 'Gold')]
medals.pivot_table(index='NOC', columns='Medal', values= 'Athlete', aggfunc='count')
Medal | Bronze | Gold | Silver |
---|---|---|---|
NOC | |||
AFG | 1.0 | NaN | NaN |
AHO | NaN | NaN | 1.0 |
ALG | 8.0 | 4.0 | 2.0 |
ANZ | 5.0 | 20.0 | 4.0 |
ARG | 88.0 | 68.0 | 83.0 |
ARM | 7.0 | 1.0 | 1.0 |
AUS | 413.0 | 293.0 | 369.0 |
AUT | 44.0 | 21.0 | 81.0 |
AZE | 9.0 | 4.0 | 3.0 |
BAH | 5.0 | 9.0 | 9.0 |
BAR | 1.0 | NaN | NaN |
BDI | NaN | 1.0 | NaN |
BEL | 150.0 | 91.0 | 167.0 |
BER | 1.0 | NaN | NaN |
BLR | 53.0 | 14.0 | 25.0 |
BOH | 6.0 | NaN | 1.0 |
BRA | 174.0 | 59.0 | 139.0 |
BUL | 136.0 | 53.0 | 142.0 |
BWI | 5.0 | NaN | NaN |
CAN | 227.0 | 154.0 | 211.0 |
CHI | 21.0 | 3.0 | 9.0 |
CHN | 193.0 | 234.0 | 252.0 |
CIV | NaN | NaN | 1.0 |
CMR | 1.0 | 20.0 | 1.0 |
COL | 7.0 | 1.0 | 3.0 |
CRC | 2.0 | 1.0 | 1.0 |
CRO | 18.0 | 31.0 | 30.0 |
CUB | 109.0 | 160.0 | 126.0 |
CZE | 13.0 | 10.0 | 18.0 |
DEN | 152.0 | 147.0 | 192.0 |
... | ... | ... | ... |
SRI | NaN | NaN | 2.0 |
SUD | NaN | NaN | 1.0 |
SUI | 138.0 | 73.0 | 165.0 |
SUR | 1.0 | 1.0 | NaN |
SVK | 8.0 | 10.0 | 11.0 |
SWE | 325.0 | 347.0 | 349.0 |
SYR | 1.0 | 1.0 | 1.0 |
TAN | NaN | NaN | 2.0 |
TCH | 105.0 | 80.0 | 144.0 |
TGA | NaN | NaN | 1.0 |
THA | 10.0 | 7.0 | 4.0 |
TJK | 1.0 | NaN | 1.0 |
TOG | 1.0 | NaN | NaN |
TPE | 13.0 | 2.0 | 27.0 |
TRI | 11.0 | 1.0 | 8.0 |
TUN | 3.0 | 2.0 | 2.0 |
TUR | 22.0 | 37.0 | 23.0 |
UAE | NaN | 1.0 | NaN |
UGA | 2.0 | 1.0 | 3.0 |
UKR | 78.0 | 32.0 | 38.0 |
URS | 584.0 | 838.0 | 627.0 |
URU | 30.0 | 44.0 | 2.0 |
USA | 1052.0 | 2088.0 | 1195.0 |
UZB | 8.0 | 4.0 | 5.0 |
VEN | 8.0 | 1.0 | 2.0 |
VIE | NaN | NaN | 2.0 |
YUG | 118.0 | 143.0 | 174.0 |
ZAM | 1.0 | NaN | 1.0 |
ZIM | 1.0 | 18.0 | 4.0 |
ZZX | 10.0 | 23.0 | 15.0 |
138 rows × 3 columns
# Construct the pivot table: counted
counted = medals.pivot_table (index='NOC', columns='Medal', values='Athlete', aggfunc='count')
# Create the new column: counted['totals']
counted['totals'] = counted.sum(axis='columns')
# Sort counted by the 'totals' column
counted = counted.sort_values('totals', ascending=False)
# Print the top 15 rows of counted
print(counted.head(15))
Medal Bronze Gold Silver totals
NOC
USA 1052.0 2088.0 1195.0 4335.0
URS 584.0 838.0 627.0 2049.0
GBR 505.0 498.0 591.0 1594.0
FRA 475.0 378.0 461.0 1314.0
ITA 374.0 460.0 394.0 1228.0
GER 454.0 407.0 350.0 1211.0
AUS 413.0 293.0 369.0 1075.0
HUN 345.0 400.0 308.0 1053.0
SWE 325.0 347.0 349.0 1021.0
GDR 225.0 329.0 271.0 825.0
NED 320.0 212.0 250.0 782.0
JPN 270.0 206.0 228.0 704.0
CHN 193.0 234.0 252.0 679.0
RUS 240.0 192.0 206.0 638.0
ROU 282.0 155.0 187.0 624.0