It is often reported that women earn more than men. The Ontario government (in Canada) reports the salary of all public-sector employees making more than a given amount. This data is freely available.
dataset = pd.read_csv("fr-2017-pssd-compendium-20180320-utf8.csv")
dataset.head(5)
Secteur | Nom de famille | Prénom | Traitement versé | Avantages imposables | Employeur | Poste | Année civile | |
---|---|---|---|---|---|---|---|---|
0 | Autres employeurs du secteur public | Coleridge | Peter | $250,000.00 | $25,749.50 | Big Brothers Big Sisters of Canada | National President and Chief Executive Officer | 2017 |
1 | Autres employeurs du secteur public | Chater | W. Matthew | $100,877.00 | $2,778.27 | Big Brothers Big Sisters of Canada | National Vice-President, Service and Organizat... | 2017 |
2 | Autres employeurs du secteur public | Mcintyre | Mary Ellen | $113,847.58 | $19,015.00 | Centre For Spanish Speaking Peoples | Legal Director | 2017 |
3 | Autres employeurs du secteur public | Marlin | Susan | $248,068.75 | $258.48 | Clinical Trials Ontario | President and Chief Executive Officer | 2017 |
4 | Autres employeurs du secteur public | Frampton | Peter | $165,033.85 | $10,788.30 | Learning Enrichment Foundation | Executive Director | 2017 |
How many people do we have? About 130 thousands.
dataset.shape
(131741, 8)
It looks good but the file contains numbers as "$113,847.58". It is not recognized by Python as a number. So we need to remove the '$' and the ',' before we can parse it as a number. So we call the "replace" function.
dataset["salaire"]=dataset['Traitement versé'].replace( '[\$,]','', regex=True ).astype(float) + dataset['Avantages imposables'].replace( '[\$,]','', regex=True ).astype(float)
pd.options.display.float_format = '${:,.2f}'.format
salaireparposte=dataset.groupby("Poste").agg({'salaire':[np.size,np.mean]}).reset_index()
salaireparposte=salaireparposte[salaireparposte[("salaire","size")]>200].sort_values(("salaire","mean"),ascending=False)
salaireparposte.columns = [col[0]+col[1] for col in salaireparposte.columns]
salaireparposte[["Poste","salairemean"]].set_index('Poste').head(10).plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x10d24c240>
dataset.head(5)
Secteur | Nom de famille | Prénom | Traitement versé | Avantages imposables | Employeur | Poste | Année civile | salaire | |
---|---|---|---|---|---|---|---|---|---|
0 | Autres employeurs du secteur public | Coleridge | Peter | $250,000.00 | $25,749.50 | Big Brothers Big Sisters of Canada | National President and Chief Executive Officer | 2017 | $275,749.50 |
1 | Autres employeurs du secteur public | Chater | W. Matthew | $100,877.00 | $2,778.27 | Big Brothers Big Sisters of Canada | National Vice-President, Service and Organizat... | 2017 | $103,655.27 |
2 | Autres employeurs du secteur public | Mcintyre | Mary Ellen | $113,847.58 | $19,015.00 | Centre For Spanish Speaking Peoples | Legal Director | 2017 | $132,862.58 |
3 | Autres employeurs du secteur public | Marlin | Susan | $248,068.75 | $258.48 | Clinical Trials Ontario | President and Chief Executive Officer | 2017 | $248,327.23 |
4 | Autres employeurs du secteur public | Frampton | Peter | $165,033.85 | $10,788.30 | Learning Enrichment Foundation | Executive Director | 2017 | $175,822.15 |
salaireparposte=dataset.groupby("Poste").agg({'salaire':[np.size,np.max]}).reset_index()
salaireparposte=salaireparposte[salaireparposte[("salaire","size")]>200].sort_values(("salaire","amax"),ascending=False)
salaireparposte.columns = [col[0]+col[1] for col in salaireparposte.columns]
salaireparposte[["Poste","salaireamax"]].set_index('Poste').head(10).plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x10e393320>
profsalaire = dataset[dataset['Poste'].str.contains("Prof") & ~(dataset['Poste'].str.contains("Dean"))]
profsalaire.sort_values("salaire",ascending=False)[["Nom de famille","Employeur","salaire"]].head(100)
Nom de famille | Employeur | salaire | |
---|---|---|---|
126061 | Ganjavi | Laurentian University of Sudbury | $686,965.86 |
129478 | Hull | University of Toronto | $465,324.29 |
122755 | Yusuf | Mcmaster University | $462,846.17 |
128612 | Christoffersen | University of Toronto | $457,575.56 |
130294 | Mitchell | University of Toronto | $451,223.87 |
131556 | Whyte | University of Toronto | $449,973.45 |
131210 | Strange | University of Toronto | $442,910.28 |
128033 | Aivazian | University of Toronto | $441,261.26 |
130167 | Mccurdy | University of Toronto | $436,216.96 |
128022 | Agrawal | University of Toronto | $428,082.50 |
131125 | Soman | University of Toronto | $424,189.00 |
130566 | Pauly | University of Toronto | $420,790.32 |
128207 | Baum | University of Toronto | $419,325.79 |
129319 | Han | University of Toronto | $414,077.12 |
122398 | Panju | Mcmaster University | $413,113.36 |
129384 | Hejazi | University of Toronto | $408,842.06 |
130946 | Schneider | University of Toronto | $405,333.81 |
129056 | Florida | University of Toronto | $405,251.64 |
131115 | Soberman | University of Toronto | $403,905.40 |
130912 | Sargent | University of Toronto | $398,555.70 |
130838 | Rotenberg | University of Toronto | $397,091.80 |
129209 | Goldreich | University of Toronto | $396,211.12 |
121811 | Crowther | Mcmaster University | $395,275.80 |
130281 | Misak | University of Toronto | $392,167.74 |
131056 | Silverman | University of Toronto | $389,464.62 |
113669 | Woolf | Queen's University | $386,629.92 |
113936 | Levin | Queen's University | $384,121.88 |
128908 | Dyck | University of Toronto | $384,073.10 |
129129 | Gans | University of Toronto | $384,064.68 |
130414 | Naylor | University of Toronto | $381,503.40 |
... | ... | ... | ... |
51052 | Alexopoulou | Hamilton Health Sciences | $336,130.62 |
120573 | Foerster | University of Western Ontario | $335,005.50 |
122390 | Orovan | Mcmaster University | $334,696.70 |
129942 | Li | University of Toronto | $332,684.88 |
129566 | John | University of Toronto | $332,475.30 |
128666 | Collingridge | University of Toronto | $331,715.52 |
120271 | Bansal | University of Western Ontario | $331,338.76 |
113769 | Anger | Queen's University | $330,174.28 |
130328 | Moorthy | University of Toronto | $329,902.14 |
121784 | Collins | Mcmaster University | $329,506.26 |
131071 | Simutin | University of Toronto | $329,392.79 |
128343 | Bova | University of Toronto | $328,917.44 |
124575 | Madhok | York University | $328,501.08 |
122442 | Price | Mcmaster University | $327,852.73 |
128332 | Booth | University of Toronto | $327,368.44 |
120294 | Beamish | University of Western Ontario | $326,990.74 |
130406 | Najm | University of Toronto | $325,777.62 |
131348 | Tombak | University of Toronto | $325,155.89 |
130784 | Ripstein | University of Toronto | $325,116.12 |
124511 | Levesque | York University | $323,129.04 |
121273 | Sapp | University of Western Ontario | $322,940.92 |
121845 | Denburg | Mcmaster University | $322,786.03 |
129227 | Gotlieb | University of Toronto | $322,649.48 |
120811 | Kim | University of Western Ontario | $321,444.96 |
130772 | Richardson | University of Toronto | $321,395.34 |
120370 | Busaba | University of Western Ontario | $320,882.58 |
129493 | Hyatt | University of Toronto | $320,876.09 |
131170 | Stein | University of Toronto | $320,327.48 |
127366 | Elayan | Brock University | $319,818.78 |
130787 | Roach | University of Toronto | $319,360.28 |
100 rows × 3 columns
Ok. So we have the dataset, but there is no gender identification. Thankfully, there are available statistical models that allow us to categorize first names.
genderstat = pd.read_csv("us-likelihood-of-gender-by-name-in-2014.csv")
genderstat[["sex","name"]].head()
sex | name | |
---|---|---|
0 | F | Elaine |
1 | F | Cathy |
2 | F | Heidi |
3 | F | Vicki |
4 | F | Melinda |
But first, let us check that things work as they should:
from unidecode import unidecode
def gender_first_name(name):
try:
return genderstat[genderstat['name'].map(lambda x : unidecode(x))==unidecode(name)][["sex"]].values[0][0]
except:
return "UNKNOWN"
for n in ["Nathalie", "Julie", "Pierre", "Yves", "Daniel", "Lohan"]:
print(n, gender_first_name(n))
Nathalie F Julie F Pierre M Yves M Daniel M Lohan UNKNOWN
Ok. So now we have two tables. What are we going to do to put the data back together? One solution is to use a "join".
We have a 'name' column in one dataset, and now we will create another name column on the other data set, after making sure to trim out accents so that "Éric" becomes "Eric".
from unidecode import unidecode
dataset['name'] = dataset['Prénom'].apply(lambda n : unidecode(n) )
dataset[['name','Prénom']].head(10)
name | Prénom | |
---|---|---|
0 | Peter | Peter |
1 | W. Matthew | W. Matthew |
2 | Mary Ellen | Mary Ellen |
3 | Susan | Susan |
4 | Peter | Peter |
5 | Therese | Therese |
6 | Daniel | Daniel |
7 | Deborah | Deborah |
8 | Shirley | Shirley |
9 | Sue | Sue |
datasetwithgender=pd.merge(dataset,genderstat,on="name")
datasetwithgender.groupby("sex").size()
sex F 43315 M 58417 dtype: int64
datasetwithgender[datasetwithgender["Poste"]=="Physician"].groupby(["sex"]).agg({'salaire':[np.mean,np.max,np.median]}).plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x111d69b00>
datasetwithgender.groupby("sex").agg({'salaire':np.mean})
salaire | |
---|---|
sex | |
F | $123,879.68 |
M | $130,332.61 |
datasetwithgender.groupby("sex").agg({'salaire':np.max})
salaire | |
---|---|
sex | |
F | $745,268.93 |
M | $1,561,320.95 |
waterlooprofessor = datasetwithgender[(datasetwithgender["Employeur"]=="University of Waterloo")
&(datasetwithgender["Poste"].str.contains("Professor"))]
waterlooprofessor[waterlooprofessor["sex"]=="M"]["salaire"].plot.hist( bins=20)
waterlooprofessor[waterlooprofessor["sex"]=="F"]["salaire"].plot.hist( bins=20)
<matplotlib.axes._subplots.AxesSubplot at 0x111713668>
waterlooprofessor.groupby("sex").agg({'salaire':np.max}).plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x1a1eed39b0>
waterlooprofessor.groupby("sex").agg({'name':np.size}).plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x1a1f4373c8>
ottawaprofessor = datasetwithgender[(datasetwithgender["Employeur"]=="University of Ottawa")
&(datasetwithgender["Poste"].str.contains("Professeur"))]
ottawaprofessor[['name','sex',"salaire"]].head(5)
name | sex | salaire | |
---|---|---|---|
796 | Peter | M | $185,750.40 |
797 | Peter | M | $153,990.13 |
798 | Peter | M | $153,903.36 |
800 | Peter | M | $179,153.60 |
801 | Peter | M | $185,766.00 |
ottawaprofessor.groupby("sex").agg({'salaire':np.mean})
salaire | |
---|---|
sex | |
F | $147,084.86 |
M | $156,445.92 |
ottawaprofessor.groupby("sex").agg({'salaire':np.max})
salaire | |
---|---|
sex | |
F | $237,641.00 |
M | $260,765.92 |
ottawaprofessor[ottawaprofessor["sex"]=="M"]["salaire"].plot.hist( bins=20)
ottawaprofessor[ottawaprofessor["sex"]=="F"]["salaire"].plot.hist( bins=20)
<matplotlib.axes._subplots.AxesSubplot at 0x1a1f030470>