this is my attempt at the CityofLA data analysis competition on kaggle using the kernals as a helping hand
The goal is to convert a folder full of plain-text job postings into a single structured CSV file and then to use this data to: (1) identify language that can negatively bias the pool of applicants; (2) improve the diversity and quality of the applicant pool; and/or (3) make it easier to determine which promotions are available to employees in each job class.
you can download the bulletin and additional files from the kaggle competition page for the city of LA.
This analysis is performed using python therefore, to run this on your system you will need a copy of python installed. I recommend getting an IDE such as PYCHARM.
The libraries required are:
Pandas: for the dataframes
IPython: to diplay the dataframes
re: for the regex templates
numpy:for the "NAN" constant
datetime: to extract date and time from the text files
matplotlib: for plotting
nltk: to perform word processing analytics
seaborn:for plotting
collections:for the counter function
scipy:for calculating line regression
os:to create a list of the bulletin text file names
first to convert the folder of plain-text job postings we need to identify a pattern present in these text file so we can extract meaninigful data using regex templates. Going through some of the text files we can see recurring patterns such as all the headlines are in caps and have the same order in all the job bulletins. we can extract these headings and use them as columns in our CSV: the Open date or the date the job was posted is also written in the same place in each bulletin. Analyzing these traits we can write a function to_dataframe to extract all the meaningful data from the text files:
def to_dataframe(path):
""""function to extract all meaningful features from job bulletin text files and convert to
pandas dataframe.
function takes one argument:
a string containing path of the bulletin text files
"""
bulletins=os.listdir(path)
num=len(bulletins)
opendate = re.compile(r'(Open [D,d]ate:)(\s+)(\d\d-\d\d-\d\d)') # match open date
salary = re.compile(r'(\$\d+,\d+)((\s(to|and)\s)(\$\d+,\d+))?((\s(and)\s)(\$\d+,\d+)((\s(to|and)\s)(\$\d+,\d+)))?') # match salary
requirements = re.compile(r'(REQUIREMENTS?/\s?MINIMUM QUALIFICATIONS?)(.*)(PROCESS NOTE)') # match requirements
for no in range(num):
with open(path+bulletins[no]) as f: # looping through reading each file
try:
file = f.read().replace('\t', '')#removing large spaces for headings extraction
data = file.replace('\n', '')#removing line breaks for content
headings = [heading for heading in file.split('\n') if
heading.isupper()] ##getting headings from job bulletins
try:
date = datetime.strptime(re.search(opendate, data).group(3), '%m-%d-%y')#match open date
except Exception as e:
print(no)#error diagnosis
try:
sal = re.search(salary, data)
except Exception as e:
print(bulletins[no])
try:
req = re.search(requirements, data).group(2)
except Exception as e:
req = re.search('(.*)NOTES?', re.findall(r'(REQUIREMENTS?)(.*)(NOTES?)',
data)[0][1][:1200]).group(1)#second type of requirement extraction
duties = re.search(r'(DUTIES)(.*)(REQ[A-Z])', data).group(2)#matching duties
try:
enddate = re.search(
r'(JANUARY|FEBRUARY|MARCH|APRIL|MAY|JUNE|JULY|AUGUST|SEPTEMBER|OCTOBER|NOVEMBER|DECEMBER)\s(\d{1,2},\s\d{4})'
, data).group()#best formulated end date match
except Exception as e:
enddate = np.nan
selection = [z[0] for z in re.findall('([A-Z][a-z]+)((\s\.\s)+)', data)] ##match selection criteria
if sal.group(9) is None:#checking for alternative starting salary
saler=np.nan
else:
saler=str(sal.group(9))
if sal.group(13) is None:#checking for alternative ending salary
saler2=np.nan
else:
saler2=str(sal.group(13))
df = df.append(
{'File Name': bulletins[no], 'Position': headings[0].lower(), 'salary_start': sal.group(1),'salary_start_alt': saler,
'salary_end': sal.group(5),'salary_end_alt': saler2, "opendate": date, "requirements": req, 'duties': duties,
'deadline': enddate, 'selection': selection}, ignore_index=True)
except Exception as e:
print('umatched sequence'+bulletins[no])
reg = re.compile(
r'(One|Two|Three|Four|Five|Six|Seven|Eight|Nine|Ten|one|two|three|four)\s(years?)\s(of\sfull(-|\s)time)')#best formuated experience match
df['EXPERIENCE_LENGTH'] = df['requirements'].apply(
lambda x: re.search(reg, x).group(1) if re.search(reg, x) is not None else np.nan)
df['FULL_TIME_PART_TIME'] = df['EXPERIENCE_LENGTH'].apply(
lambda x: 'FULL_TIME' if x is not np.nan else np.nan)
reg = re.compile(
r'(One|Two|Three|Four|Five|Six|Seven|Eight|Nine|Ten|one|two|three|four)(\s|-)(years?)\s(college)')#best education years match
df['EDUCATION_YEARS'] = df['requirements'].apply(
lambda x: re.search(reg, x).group(1) if re.search(reg, x) is not None else np.nan)
df['SCHOOL_TYPE'] = df['EDUCATION_YEARS'].apply(
lambda x: 'College or University' if x is not np.nan else np.nan)
reg2 = re.compile(
r"(semester|Semester|quarter units|college|College|university|University|Bachelor's|bachelor's|Bachelor|bachelor|Master's|master's)")#second criteria for school type
for index, row in df.iterrows():
if row['SCHOOL_TYPE'] is np.nan:
if re.search(reg2, row['requirements']) is not None:
df.loc[index, 'SCHOOL_TYPE'] = 'College or University'
return df
An example of executing this function would be:
path="C:/Users/twist/CityofLA/Job Bulletins/"
df=to_dataframe(path)
df.to_csv('jobs_output.csv')#output to csv
We can then view the csv file as can be seen most of the data from the job bulletins has been captured. By printing the shape of the data frame we can know the exact number of entries
print(df.shape)
(665, 15)
665 out of the initial 683 are extracted now we can perform analysis on them.
By analyzing reoccuring words in the job title we can make out the differant job sectors available in the city of LA. We can then create a plot of the top 10 to figure out the most dominant.
plt.figure(figsize=(8,5))
text=''.join(job for job in df['Position']) ##joining data to form text
text=nltk.tokenize.word_tokenize(text)
job_class_final=list() ##counting number of occurences
checker=text.copy()
for b in range(20):
jobs = Counter(checker)
jobs_class=[job for job in jobs.most_common() if len(job[0])>3] ##selecting most common words
job_class_final.append(jobs_class[0])
black=list()
for y in checker:
if y.find(jobs_class[0][0])== -1: # finding those that do not match the criteria
black.append(y) # creating a list out of them to rerun the test
checker=black
a,b=map(list, zip(*job_class_final)) ## seperating the tuple into two different variables
sns.barplot(b,a,palette='rocket') ##creating barplot
plt.title('Job sectors')
plt.xlabel("count")
plt.ylabel('sector')
plt.show()
As seen from the bar plot the most dominant sectors are engineering and services.
to analyze the salary satistics we must first convert in to pure numbers removing the comas and the dollar signs
df['salary_start']=[sal.replace('$','') if sal!= None else 0 for sal in df['salary_start'] ]
df['salary_start']=[int(sal.split(',')[0]+sal.split(',')[1] ) if type(sal)!=int else 0 for sal in df['salary_start']]
df['salary_end']=[sal.replace('$','') if sal!= None else 0 for sal in df['salary_end'] ]
df['salary_end']=[int(sal.split(',')[0]+sal.split(',')[1] ) if type(sal)!=int else 0 for sal in df['salary_end']]
df['salary_start_alt']=[str(pal).replace('$','') if pal is not None else 0 for pal in df['salary_start_alt'] ]
df['salary_start_alt']=[int(sal.split(',')[0]+sal.split(',')[1] ) if type(sal)!=int and sal!="nan" else 0 for sal in df['salary_start_alt']]
df['salary_end_alt']=[str(sal).replace('$','') if sal is not None and not float() else 0 for sal in df['salary_end_alt'] ]
df['salary_end_alt']=[int(sal.split(',')[0]+sal.split(',')[1] ) if type(sal)!=int and sal!="nan" else 0 for sal in df['salary_end_alt']]
we can then plot the distribution of salaries of all the jobs to see how they vary
plt.figure(figsize=(7,5))
sns.distplot(df['salary_start'])
plt.title('salary distribution')
plt.show()
As seen from the plot that the salaries vary from 40k to 150k with the average around 80k. we can also check for the highest paid jobs and the jobs that have the highest deviation from start to end salary
most_paid=df[['Position','salary_start']].sort_values(by='salary_start',ascending=False)[:10]
plt.figure(figsize=(7,5))
sns.barplot(y=most_paid['Position'],x=most_paid['salary_start'],palette='rocket')
plt.title('Best paid jobs in LA')
plt.show()
the largest job market might be engineering but the highest paid jobs come from the services sector and shipping and port sector
test=pd.DataFrame(columns=['salary_diff','salary_diff_alt'])
#test['salary_diff']=[abs(row['salary_start']-row['salary_end_alt']) if row['salary_end_alt']!="nan" else abs(row['salary_start']-row['salary_end']) for index, row in df.iterrows() ]
test['salary_diff_alt']=abs(df['salary_start_alt']-df['salary_end_alt'])
test['salary_diff']=abs(df['salary_start']-df['salary_end'])
df['salary_diff']=[row['salary_diff'] if row['salary_diff']> row['salary_diff_alt'] else row['salary_diff_alt'] for index, row in test.iterrows() ]
ranges=df[['Position','salary_diff']].sort_values(by='salary_diff',ascending=False)[:10]
plt.figure(figsize=(7,5))
sns.barplot(y=ranges['Position'],x=ranges['salary_diff'],palette='RdBu')
plt.title("Highest deviation")
plt.show()
Mechanical repair general supervisor job has the highest deviation. it is clear from this that a supervisor position has the most salary growth without the need for a promotion.
we can check for the fluctuation in the job market over the years by analyzing the open date of the job listings
df['year_of_open']=[date.year for date in df['opendate']]
count=df['year_of_open'].value_counts(ascending=True)
years=['2020','2019','2018', '2017', '2016', '2015', '2014', '2013', '2012', '2008', '2006',
'2005', '2002', '1999']
plt.figure(figsize=(7,5))
plt.plot([z for z in reversed(years)],count.values,color='blue')
plt.title('Oppurtunities over years')
plt.xlabel('years')
plt.ylabel('count')
plt.gca().set_xticklabels([z for z in reversed(years)],rotation='45')
plt.show()
The opportunities seem to be rising as a whole. We can also check how the top 10 sectors have changed induvidually over the years
years2=df[['Position','year_of_open']].copy()
years2=years2.sort_values(by='year_of_open',ascending=False)
for index,row in years2.iterrows():
for x,y in job_class_final:
if row['Position'].find(x)!=-1:
years2.loc[index, 'sector'] =x
break
else:
years2.loc[index, 'sector'] =np.nan
plt.figure(figsize=(7, 5))
plt.plot([z for z in reversed(years)], counter[::-1],label=x)
plt.title('Oppurtunities over years')
plt.xlabel('years')
plt.ylabel('count')
plt.gca().set_xticklabels([z for z in reversed(years)],rotation='45')
plt.legend(loc='upper left')
plt.show()
We can also check the fastest growing sectors in the city
sectory=list()
slopy=list()
for x,y in job_class_final[:][:10] :
#ploter=[int(row['year_of_open']) if row['sector']==x else np.nan for index,row in years2.iterrows()]
plotter = years2[years2['sector'] == x]
plotter=plotter.groupby(['year_of_open']).size().reset_index(name='Size')
counter=list()
display(plotter.dtypes)
for ye in years:
sizer=len(counter)
for index,row in plotter.iterrows():
if int(ye) == int(row['year_of_open']) :
counter.append(row['Size'])
break
if sizer==len(counter):
counter.append(0)
sloperr, intercept, r_value, p_value, std_err = linregress(list(map(int, years[1:11][::-1])), counter[1:11][::-1])
print(sloperr)
slopy.append(sloperr)
sectory.append(x)
sloper=pd.DataFrame({'sector':sectory})
sloper['slope']=slopy
print(sloper)
plt.figure(figsize=(7,5))
sloper=sloper.sort_values(by='slope',ascending=False)
sns.barplot(y=sloper['sector'],x=sloper['slope'],palette='RdBu')
plt.title('fastest growing job sector as of 2018')
plt.show()
The fastest growing sector in the city is engineering and these results are inline with the largest sectors.
We can check what level of experience is required by the job bulletins:
experience=df['EXPERIENCE_LENGTH'].value_counts().reset_index()
experience['index']=experience['index'].apply(lambda x : x.lower())
experience=experience.groupby('index',as_index=False).agg('sum')
labels=experience['index']
sizes=experience['EXPERIENCE_LENGTH']
plt.figure(figsize=(5,7))
plt.pie(sizes,explode=(0, 0.1, 0, 0,0,0,0),labels=labels)
plt.gca().axis('equal')
plt.title('Experience value count')
plt.show()
Most of the jobs require two years of experience also wether education or experience is prefered:
x1=df['SCHOOL_TYPE'].value_counts()[0]
x2=df['FULL_TIME_PART_TIME'].value_counts()[0]
plt.figure(figsize=(5,5))
plt.bar(height=[x1,x2],x=['College Degree','Experience'])
plt.show()
According to the listings most of the jobs prefer experience over a college degree
We can check the most common requirements by analyzing the words present in them
token=nltk.tokenize.word_tokenize(req)
counter=Counter(token)
count=[x for x in counter.most_common(40) if len(x[0])>3]
print("Most common words in Requirement")
print(count)
plotc=count[:][:10]
a,b=map(list, zip(*plotc))est
sns.barplot(b,a,palette='rocket') ##creating barplot
plt.title('most common requirements')
plt.xlabel("count")
plt.ylabel('requirements')
plt.show()
Most common words in Requirement
[('experience', 1211), ('paid', 945), ('full-time', 930), ('years', 850), ('with', 594), ('City', 418), ('Angeles', 416),
('level', 318), ('from', 274), ('year', 263), ('college', 256), ('accredited', 253), ('university', 218), ('class', 199), ('professional', 189), ('maintenance', 172), ('which', 169), ('units', 157)]
Experience is the most in demand where as college or unversity are not even present in the top 10 we can also see what most of the jobs use as screening processes:
plt.figure(figsize=(7,7))
count=df['selection'].astype(str).value_counts()[:10]
sns.barplot(y=count.index,x=count,palette='rocket')
plt.gca().set_yticklabels(count.index,rotation='45')
plt.show()
what months are the job offers available in:
plt.figure(figsize=(7,5))
df['open_month']=[z.month for z in df['opendate']]
count=df['open_month'].value_counts(sort=False)
sns.barplot(y=count.values,x=count.index,palette='rocket')
month_name=['','january','february','march','april','may','june','july','august','september','october','november','december']
plt.gca().set_xticklabels([month_name[x] for x in count.index],rotation='45')
plt.show()
From this graph the jobs seem to be well distributed across the year We can also see which of the job postings donot have an end date
print('%d job applications may close without prior notice' %df['deadline'].isna().sum())
44 job applications may close without prior notice
Most jobs only require an interview along with an essay
We can also check pronoun usage to check gender bias
def pronoun(data):
'''function to tokenize data and perform pos_tagging.Returns tokens having "PRP" tag'''
prn = []
vrb = []
token = nltk.tokenize.word_tokenize(data)
pos = nltk.tag.pos_tag(token)
vrb = Counter([x[0] for x in pos if x[1] == 'PRP'])
return vrb
req_prn = pronoun(req)
duties= ' '.join(d for d in df['duties'])
duties_prn = pronoun(duties)
print('pronouns used in requirement section are')
print(req_prn.keys())
print('\npronouns used in duties section are')
print(duties_prn.keys())
pronouns used in requirement section are
dict_keys(['or2', 'or3', 'or4', 'orb', 'you', 'I', 'they', 'orc', 'or5', 'or6', 'or7', 'or8', 'You', 'it', 'orB', 'offset', 'ore.'])
pronouns used in duties section are
dict_keys(['they', 'it', 'You', 'I', 'you', 's', 'or2', 'They'])
The pronouns used are all gender nuetral check for gender biased job listings such as 'policeman' or 'fireman'
for name in df['Position']:
z=re.match(r'\w+?\s?\w+(man|woman|men|women)$',name)
x = re.match(r'\w+?\s?(man|woman|men|women|male|female)$', name)
if z is not None:
print(z)
if x is not None:
print(x)
None found the job listings are completely unbiased
Designed to indicate how difficult a reading passage is to understand. The result is a number that corresponds with a U.S grade level. FKGL = 0.39 * (total words/ total sentences) + 11.8 (total syllables/ total words) -15.59
Flesch Index ------- Text file reading Grade<br>
0-30 --------- College<br>
50-60 --------- High School<br>
90-100 --------- Fourth Grade<br>
reading = []
for file in df['File Name']:
text=open( "C:/Users/twist/CityofLA/Job Bulletins/"+file,'r').read()
#text = open("../input/cityofla/CityofLA/Job Bulletins/" + file, 'r').read()
sentence = text.count('.') + text.count('!') + text.count(';') + text.count(':') + text.count('?')
words = len(text.split())
syllable = 0
for word in text.split():
for vowel in ['a', 'e', 'i', 'o', 'u']:
syllable += word.count(vowel)
for ending in ['es', 'ed', 'e']:
if word.endswith(ending):
syllable -= 1
if word.endswith('le'):
syllable += 1
G = round((0.39 * words) / sentence + (11.8 * syllable) / words - 15.59)
reading.append(G)
#text.close()
plt.hist(reading)
plt.xlabel('Flesch Index')
plt.title('Flesch index distribution')
plt.show()
The reding index is all below 10 which means all the bulletins are college level when it comes to readability. Lets see how that constrasts with college requirements
tot=df['SCHOOL_TYPE'].shape[0]
x1=round(df[df['SCHOOL_TYPE']=='College or University'].shape[0]/tot*100)
x2=100-x1
plt.figure(figsize=(5,5))
plt.bar(height=[x1,x2],x=['College Degree','No College Degree'])
plt.show()
We can check for relevant promotions as following
def similar_jobs(job):
''' function to find and return jobs with similar job title.take a single argument
- job title
returns
-list of similar jobs '''
word1 = nltk.tokenize.word_tokenize(job)
jobs = []
for i, name in enumerate(df['Position']):
word2 = nltk.tokenize.word_tokenize(name)
# print(word2)
# print(set(nltk.ngrams(word2, n=1)))
distance = nltk.jaccard_distance(set(nltk.ngrams(word1, n=1)), set(nltk.ngrams(word2, n=1)))
if distance < .4:
jobs.append((name, i))
return jobs
promotions=list()
for name in df['Position']:
z=re.match(r'\w*\s*\w*\s*(supervisor|manager|specialist|senior)\s*\w*\s*\w*$',name)
if z is not None:
promotions.append(name)
# print(name)
positions=list()
for y in promotions:
z=re.sub(r'\s?(supervisor|manager|specialist|senior)\s?$',"",str(y))
positions.append(z)
lowers=list()
print(positions)
for name in df['Position']:
for check in positions:
if name.find(check)!=-1 and name not in promotions and name not in lowers:
lowers.append(name)
print(lowers)
listprom=pd.DataFrame(columns=['start','promotion','salary'])
for pos in df['Position']:
promotioner = list()
salprom = list()
sim=similar_jobs(pos)
sal=df.loc[df['Position']==pos,'salary_start'].iloc[0]
#print('sal=')
# print(sal)
if sim is not None:
for name,index in sim:
# print(df.loc[index,'salary_start'])
if int(df.loc[index,'salary_start'])>int(sal):
promotioner.append(name)
salprom.append(df.loc[index,'salary_start'])
# print(promotioner)
df2=pd.DataFrame({'promotion':promotioner})
df2['salary']=salprom
df2['start']=pos
#display(df2)
listprom=listprom.append(df2)
print(listprom.shape)
listprom=listprom.reset_index(drop=True)
The picture above shows an example from the job bulletins Also jobs with similar equirements can be found as following:
def similar_req(job):
''' function to find and return jobs with similar job title.take a single argument
- job title
returns
-list of similar jobs '''
word1 = nltk.tokenize.word_tokenize(job)
jobs = []
for i, name in enumerate(df['requirements']):
word2 = nltk.tokenize.word_tokenize(name)
distance = nltk.jaccard_distance(set(nltk.ngrams(word1, n=1)), set(nltk.ngrams(word2, n=1)))
if (distance < .5):
jobs.append((name, df.iloc[i]['Position']))
return jobs
print(similar_req(df['requirements'][10]))
[('chief of drafting operations', 117), ('chief of operations', 118)]
this way applicants applying for one job can know about other jobs with similar requirements making them easier to access
1.After estimating the sectors of jobs it was found out that the diversity in the job market is mostly dominated by the engineering and services sectors
2.Salary distribution is inline with the american middle class
3.the highest paying jobs are in the services sector and shipping and port management
4.greatest pay rise without a promotion can be achieved generally in the supervisor position
5.There is no shortage of job opportunities in LA with job listings rising at an exponential level since 2016
6.the rise in job listings is largely contributed to by the largest job sectors
7.The fastest growing job sectors are the most dominant
8.Experience of at least two years is demanded by the majority of job listings
9.Experience is valued higher than a college degree with almost all jobs requiring some sort of experience where as only a third require a college degree of some sort
10. job listings are distributed mostly evenly across the year
11.checking the requirements of the jobs mostly shows previous experience as the most dominant requirement
12.most jobs require an essay along with an interview for the screening process whereas essay, interview and tests(multiple choice etc) being the most common selection processes
13.the pronouns used in the job bulletins are all gender nuetral
14. No gender biased job titles are use like "fire man","police man" etc.
1.More no collar jobs need to be introduced as the job market seems lacking in that area.
2. The most succesful economies are ones where the job market is diverse. The LA Job market seems to be largely dominated by 1 or 2 sectors. Diversity in the job market needs to be introduced.
3. Almost all jobs require previous experience which discourages fresh graduates and keeps them out of the job market.
4. Most jobs have 2 to 3 screening procedures which limits potential candidates. They should be reduced to just a face to face interview.
5. the bulletins have reading difficulty of a college graduate whereas two thirds of them donot require a college degree. The bulletins should be constructed with simpler language
6. the city of LA can use the promotions function to find out the promotions available to each job listing
7. the city of LA can use the similar requirents function to find similar jobs so as people applying for one job can find out other that they might be eligible to aswell
8. All job bulletins need to have a closing date specified as to avoid confusion
- shahules kernel on kaggle