Optimizing my search for Data scientist jobs by scraping Indeed with R | R-bloggers

% # Clean the data to only get numbers substr(start = 2, stop = 8) %>% as.numeric() ## [1] 1761 For now, we can only scrape the data from the first page. However, I am interested in all the job posts and I need to access the other pages ! After navigating through the first 3 pages of listed jobs, I remarked a pattern in the URL address (valid at the time of writing), this means that with a line of code, I can produce a list containing the URLs for the first 40 pages. Once I have the list, the only thing left is to loop over all the URLs with some delay (good practice for web-scraping), collect the data and clean it with custom functions (at the end of the post): # Creating URL link corresponding to the first 40 pages base_url = "https://fr.indeed.com/jobs?q=data%20scientist&l=France&start=" url_list % html_element("h2") %>% html_text2() %>% str_replace(".css.*;\}", "") # URL for job post job_url % html_elements(css = ".mosaic-provider-jobcards .result")%>% html_elements(css = ".resultContent") %>% html_element("h2") %>% html_element("a") %>% html_attr('href') %>% lapply(function(x){paste0("https://fr.indeed.com", x)}) %>% unlist() # Data about company company_info % html_elements(css = ".mosaic-provider-jobcards .result")%>% html_elements(css = ".resultContent")%>% html_element(css = ".company_location")%>% html_text2() %>% lapply(FUN = tidy_comploc) %>% # Function to clean the textual data do.call(rbind, .) # Data about job description job_desc % html_elements(css = ".mosaic-provider-jobcards .result")%>% html_element(css =".slider_container .jobCardShelfContainer")%>% html_text2() %>% tidy_job_desc() # Function to clean the textual data related to job desc. # Data about salary (when indicated) salary_hour % html_elements(css = ".mosaic-provider-jobcards .result .resultContent")%>% html_element(css = ".salaryOnly") %>% html_text2() %>% lapply(FUN = tidy_salary) %>% # Function to clean the data related to salary do.call(rbind, .) # Job posts in the same format final_df % ggplot(aes(x = Company)) + geom_point(aes(y = Mean_salary), colour = "#267266") + geom_linerange(aes(ymin = Low_salary, ymax = High_salary)) + geom_hline(aes(yintercept = median(Mean_salary)), lty=2, col='red', alpha = 0.7) + scale_y_continuous(labels = euro) + ylab("Monthly income") + xlab("") + coord_flip() + theme_bw(base_size = 8) The median monthly salary is around 3700 euros. As you can see the salaries can vary a lot depending on the company. This is partly due because I didn’t make distinction between the different data science jobs (data scientist, data analyst, data engineer, senior or lead). Salary by job title We can plot the same graph but instead of grouping by company we can group by job title: final_df %>% filter(Low_salary > 1600) %>% # To remove internships and freelance works select(Job_title_c, Low_salary, High_salary, Job_type) %>% group_by(Job_title_c) %>% summarize_if(is.numeric, ~ mean(.x, na.rm = TRUE)) %>% mutate(Mean_salary = rowMeans(cbind(Low_salary, High_salary), na.rm = T), Job_title_c = fct_reorder(Job_title_c, desc(-Mean_salary))) %>% ggplot(aes(x = Job_title_c, y = Mean_salary)) + geom_point(aes(y = Mean_salary), colour = "#267266") + geom_linerange(aes(ymin = Low_salary, ymax = High_salary)) + #geom_label(aes(label = n, Job_title_c, y = 1500), data = count_df) + scale_y_continuous(labels = euro) + theme_bw(base_size = 12) + xlab("") + ylab("Monthly Income") + coord_flip() We clearly see the differences in proposed salaries depending on the job title: data scientists seem to earn slightly more in average than data analysts. The companies also seem to propose higher salaries for jobs with more responsibilities or requiring more experiences (senior, lead). Salary depending on location: full remote, hybrid, on site ? Finally we can plot the salaries depending on the location (full remote, hybrid, on site) to see if it has an impact: # Tidy the types and locations of listed jobs final_df 1600), Job_type) final_df %>% filter(Low_salary > 1600) %>% drop_na(Location) %>% mutate(Mean_salary = rowMeans(cbind(Low_salary, High_salary), na.rm = T), Job_type = as.factor(Job_type)) %>% ggplot(aes(x = Job_type, y = Mean_salary)) + geom_boxplot(na.rm = TRUE) + geom_label(aes(label = n, Job_type, y = 5500), data = count_df) + scale_y_continuous(labels = euro) + theme_bw(base_size = 12) + xlab("Job Type") + ylab("Income") It is worth noting that most of the jobs proposed in France are on site jobs. The median salary for this type of jobs is slightly lower than hybrid jobs. The salary distribution of full remote and hybrid jobs must be taken with care as it is only represented by 12 job posts. Mapping job locations During my job search, I was frustrated not to see a geographical map regrouping the locations of all the proposed jobs. Such map could help me greatly in my search. Let’s do it ! First, we must tidy and homogenize the locations for all the job posts. To this end, I made a custom function (tidy_location()) which includes some stringr functions, you can find more details about this function at the end of this post. It outputs the location in this format [Town]([Zip code]). Even though all the locations have been homogenized, it can not be plotted on a map (we need the longitude and latitude). To get the latitude and longitude with the town name and zip code I used the geocode() function from tidygeocoder package. # Extract coordinates from town name final_df % mutate(Loc_tidy_fr = paste(Loc_tidy, 'France')) %>% geocode(Loc_tidy_fr, method = 'arcgis', lat = latitude , long = longitude) %>% select(- Loc_tidy_fr) Distribution of Data Science jobs in France We can now represent the number of Data Science jobs by departments: # Map of France from rnaturalearth package france % filter(!name %in% c("Guyane française", "Martinique", "Guadeloupe", "La Réunion", "Mayotte")) # Transform location to st point test % group_by(region) %>% summarize(Job_number=n()) %>% mutate(Job_number = cut(Job_number, my_breaks)) %>% ggplot() + geom_sf(aes(fill=Job_number), col='grey', lwd=0.2) + scale_fill_brewer("Job number",palette = "GnBu") + theme_bw() It is really interesting to see that the distribution of jobs is quite heterogeneous in France. The majority of the jobs are concentrated in a few departments that include a large city. It is expected as most of the jobs are proposed by large company that are often installed in the proximity of important cities. Interactive map We can go further and plot an interactive map with leaflet which allows us to search dynamically for a job post: # Plot leaflet map final_df %>% mutate(pop_up_text = sprintf("%s %s", Job_title, Company)) %>% # Make popup text leaflet() %>% setView(lng = 2.36, lat = 46.31, zoom = 5.2) %>% # Center of France addProviderTiles(providers$CartoDB.Positron) %>% addMarkers( popup = ~as.character(pop_up_text), clusterOptions = markerClusterOptions() ) Analyzing job descriptions Nowadays most of the resumes are scanned and interpreted by an applicant tracking system (ATS). To make things simple, this system looks for key words in your resume and assess the match with the job you are applying for. It is therefore important to describe your experiences with specific key words to improve the chances of getting to the next step of the hiring process. But what key words should I include in my resume ? Let’s answer this question by analyzing the job descriptions of data scientist jobs. Downloading and cleaning each job description First we download the full description of each job by navigating through all the URL listed in our table. We then clean and homogenize the description with a custom function: # Loop through all the URLs job_descriptions" />
% # Clean the data to only get numbers substr(start = 2, stop = 8) %>% as.numeric() ## [1] 1761 For now, we can only scrape the data from the first page. However, I am interested in all the job posts and I need to access the other pages ! After navigating through the first 3 pages of listed jobs, I remarked a pattern in the URL address (valid at the time of writing), this means that with a line of code, I can produce a list containing the URLs for the first 40 pages. Once I have the list, the only thing left is to loop over all the URLs with some delay (good practice for web-scraping), collect the data and clean it with custom functions (at the end of the post): # Creating URL link corresponding to the first 40 pages base_url = "https://fr.indeed.com/jobs?q=data%20scientist&l=France&start=" url_list % html_element("h2") %>% html_text2() %>% str_replace(".css.*;\}", "") # URL for job post job_url % html_elements(css = ".mosaic-provider-jobcards .result")%>% html_elements(css = ".resultContent") %>% html_element("h2") %>% html_element("a") %>% html_attr('href') %>% lapply(function(x){paste0("https://fr.indeed.com", x)}) %>% unlist() # Data about company company_info % html_elements(css = ".mosaic-provider-jobcards .result")%>% html_elements(css = ".resultContent")%>% html_element(css = ".company_location")%>% html_text2() %>% lapply(FUN = tidy_comploc) %>% # Function to clean the textual data do.call(rbind, .) # Data about job description job_desc % html_elements(css = ".mosaic-provider-jobcards .result")%>% html_element(css =".slider_container .jobCardShelfContainer")%>% html_text2() %>% tidy_job_desc() # Function to clean the textual data related to job desc. # Data about salary (when indicated) salary_hour % html_elements(css = ".mosaic-provider-jobcards .result .resultContent")%>% html_element(css = ".salaryOnly") %>% html_text2() %>% lapply(FUN = tidy_salary) %>% # Function to clean the data related to salary do.call(rbind, .) # Job posts in the same format final_df % ggplot(aes(x = Company)) + geom_point(aes(y = Mean_salary), colour = "#267266") + geom_linerange(aes(ymin = Low_salary, ymax = High_salary)) + geom_hline(aes(yintercept = median(Mean_salary)), lty=2, col='red', alpha = 0.7) + scale_y_continuous(labels = euro) + ylab("Monthly income") + xlab("") + coord_flip() + theme_bw(base_size = 8) The median monthly salary is around 3700 euros. As you can see the salaries can vary a lot depending on the company. This is partly due because I didn’t make distinction between the different data science jobs (data scientist, data analyst, data engineer, senior or lead). Salary by job title We can plot the same graph but instead of grouping by company we can group by job title: final_df %>% filter(Low_salary > 1600) %>% # To remove internships and freelance works select(Job_title_c, Low_salary, High_salary, Job_type) %>% group_by(Job_title_c) %>% summarize_if(is.numeric, ~ mean(.x, na.rm = TRUE)) %>% mutate(Mean_salary = rowMeans(cbind(Low_salary, High_salary), na.rm = T), Job_title_c = fct_reorder(Job_title_c, desc(-Mean_salary))) %>% ggplot(aes(x = Job_title_c, y = Mean_salary)) + geom_point(aes(y = Mean_salary), colour = "#267266") + geom_linerange(aes(ymin = Low_salary, ymax = High_salary)) + #geom_label(aes(label = n, Job_title_c, y = 1500), data = count_df) + scale_y_continuous(labels = euro) + theme_bw(base_size = 12) + xlab("") + ylab("Monthly Income") + coord_flip() We clearly see the differences in proposed salaries depending on the job title: data scientists seem to earn slightly more in average than data analysts. The companies also seem to propose higher salaries for jobs with more responsibilities or requiring more experiences (senior, lead). Salary depending on location: full remote, hybrid, on site ? Finally we can plot the salaries depending on the location (full remote, hybrid, on site) to see if it has an impact: # Tidy the types and locations of listed jobs final_df 1600), Job_type) final_df %>% filter(Low_salary > 1600) %>% drop_na(Location) %>% mutate(Mean_salary = rowMeans(cbind(Low_salary, High_salary), na.rm = T), Job_type = as.factor(Job_type)) %>% ggplot(aes(x = Job_type, y = Mean_salary)) + geom_boxplot(na.rm = TRUE) + geom_label(aes(label = n, Job_type, y = 5500), data = count_df) + scale_y_continuous(labels = euro) + theme_bw(base_size = 12) + xlab("Job Type") + ylab("Income") It is worth noting that most of the jobs proposed in France are on site jobs. The median salary for this type of jobs is slightly lower than hybrid jobs. The salary distribution of full remote and hybrid jobs must be taken with care as it is only represented by 12 job posts. Mapping job locations During my job search, I was frustrated not to see a geographical map regrouping the locations of all the proposed jobs. Such map could help me greatly in my search. Let’s do it ! First, we must tidy and homogenize the locations for all the job posts. To this end, I made a custom function (tidy_location()) which includes some stringr functions, you can find more details about this function at the end of this post. It outputs the location in this format [Town]([Zip code]). Even though all the locations have been homogenized, it can not be plotted on a map (we need the longitude and latitude). To get the latitude and longitude with the town name and zip code I used the geocode() function from tidygeocoder package. # Extract coordinates from town name final_df % mutate(Loc_tidy_fr = paste(Loc_tidy, 'France')) %>% geocode(Loc_tidy_fr, method = 'arcgis', lat = latitude , long = longitude) %>% select(- Loc_tidy_fr) Distribution of Data Science jobs in France We can now represent the number of Data Science jobs by departments: # Map of France from rnaturalearth package france % filter(!name %in% c("Guyane française", "Martinique", "Guadeloupe", "La Réunion", "Mayotte")) # Transform location to st point test % group_by(region) %>% summarize(Job_number=n()) %>% mutate(Job_number = cut(Job_number, my_breaks)) %>% ggplot() + geom_sf(aes(fill=Job_number), col='grey', lwd=0.2) + scale_fill_brewer("Job number",palette = "GnBu") + theme_bw() It is really interesting to see that the distribution of jobs is quite heterogeneous in France. The majority of the jobs are concentrated in a few departments that include a large city. It is expected as most of the jobs are proposed by large company that are often installed in the proximity of important cities. Interactive map We can go further and plot an interactive map with leaflet which allows us to search dynamically for a job post: # Plot leaflet map final_df %>% mutate(pop_up_text = sprintf("%s %s", Job_title, Company)) %>% # Make popup text leaflet() %>% setView(lng = 2.36, lat = 46.31, zoom = 5.2) %>% # Center of France addProviderTiles(providers$CartoDB.Positron) %>% addMarkers( popup = ~as.character(pop_up_text), clusterOptions = markerClusterOptions() ) Analyzing job descriptions Nowadays most of the resumes are scanned and interpreted by an applicant tracking system (ATS). To make things simple, this system looks for key words in your resume and assess the match with the job you are applying for. It is therefore important to describe your experiences with specific key words to improve the chances of getting to the next step of the hiring process. But what key words should I include in my resume ? Let’s answer this question by analyzing the job descriptions of data scientist jobs. Downloading and cleaning each job description First we download the full description of each job by navigating through all the URL listed in our table. We then clean and homogenize the description with a custom function: # Loop through all the URLs job_descriptions" />
% # Clean the data to only get numbers substr(start = 2, stop = 8) %>% as.numeric() ## [1] 1761 For now, we can only scrape the data from the first page. However, I am interested in all the job posts and I need to access the other pages ! After navigating through the first 3 pages of listed jobs, I remarked a pattern in the URL address (valid at the time of writing), this means that with a line of code, I can produce a list containing the URLs for the first 40 pages. Once I have the list, the only thing left is to loop over all the URLs with some delay (good practice for web-scraping), collect the data and clean it with custom functions (at the end of the post): # Creating URL link corresponding to the first 40 pages base_url = "https://fr.indeed.com/jobs?q=data%20scientist&l=France&start=" url_list % html_element("h2") %>% html_text2() %>% str_replace(".css.*;\}", "") # URL for job post job_url % html_elements(css = ".mosaic-provider-jobcards .result")%>% html_elements(css = ".resultContent") %>% html_element("h2") %>% html_element("a") %>% html_attr('href') %>% lapply(function(x){paste0("https://fr.indeed.com", x)}) %>% unlist() # Data about company company_info % html_elements(css = ".mosaic-provider-jobcards .result")%>% html_elements(css = ".resultContent")%>% html_element(css = ".company_location")%>% html_text2() %>% lapply(FUN = tidy_comploc) %>% # Function to clean the textual data do.call(rbind, .) # Data about job description job_desc % html_elements(css = ".mosaic-provider-jobcards .result")%>% html_element(css =".slider_container .jobCardShelfContainer")%>% html_text2() %>% tidy_job_desc() # Function to clean the textual data related to job desc. # Data about salary (when indicated) salary_hour % html_elements(css = ".mosaic-provider-jobcards .result .resultContent")%>% html_element(css = ".salaryOnly") %>% html_text2() %>% lapply(FUN = tidy_salary) %>% # Function to clean the data related to salary do.call(rbind, .) # Job posts in the same format final_df % ggplot(aes(x = Company)) + geom_point(aes(y = Mean_salary), colour = "#267266") + geom_linerange(aes(ymin = Low_salary, ymax = High_salary)) + geom_hline(aes(yintercept = median(Mean_salary)), lty=2, col='red', alpha = 0.7) + scale_y_continuous(labels = euro) + ylab("Monthly income") + xlab("") + coord_flip() + theme_bw(base_size = 8) The median monthly salary is around 3700 euros. As you can see the salaries can vary a lot depending on the company. This is partly due because I didn’t make distinction between the different data science jobs (data scientist, data analyst, data engineer, senior or lead). Salary by job title We can plot the same graph but instead of grouping by company we can group by job title: final_df %>% filter(Low_salary > 1600) %>% # To remove internships and freelance works select(Job_title_c, Low_salary, High_salary, Job_type) %>% group_by(Job_title_c) %>% summarize_if(is.numeric, ~ mean(.x, na.rm = TRUE)) %>% mutate(Mean_salary = rowMeans(cbind(Low_salary, High_salary), na.rm = T), Job_title_c = fct_reorder(Job_title_c, desc(-Mean_salary))) %>% ggplot(aes(x = Job_title_c, y = Mean_salary)) + geom_point(aes(y = Mean_salary), colour = "#267266") + geom_linerange(aes(ymin = Low_salary, ymax = High_salary)) + #geom_label(aes(label = n, Job_title_c, y = 1500), data = count_df) + scale_y_continuous(labels = euro) + theme_bw(base_size = 12) + xlab("") + ylab("Monthly Income") + coord_flip() We clearly see the differences in proposed salaries depending on the job title: data scientists seem to earn slightly more in average than data analysts. The companies also seem to propose higher salaries for jobs with more responsibilities or requiring more experiences (senior, lead). Salary depending on location: full remote, hybrid, on site ? Finally we can plot the salaries depending on the location (full remote, hybrid, on site) to see if it has an impact: # Tidy the types and locations of listed jobs final_df 1600), Job_type) final_df %>% filter(Low_salary > 1600) %>% drop_na(Location) %>% mutate(Mean_salary = rowMeans(cbind(Low_salary, High_salary), na.rm = T), Job_type = as.factor(Job_type)) %>% ggplot(aes(x = Job_type, y = Mean_salary)) + geom_boxplot(na.rm = TRUE) + geom_label(aes(label = n, Job_type, y = 5500), data = count_df) + scale_y_continuous(labels = euro) + theme_bw(base_size = 12) + xlab("Job Type") + ylab("Income") It is worth noting that most of the jobs proposed in France are on site jobs. The median salary for this type of jobs is slightly lower than hybrid jobs. The salary distribution of full remote and hybrid jobs must be taken with care as it is only represented by 12 job posts. Mapping job locations During my job search, I was frustrated not to see a geographical map regrouping the locations of all the proposed jobs. Such map could help me greatly in my search. Let’s do it ! First, we must tidy and homogenize the locations for all the job posts. To this end, I made a custom function (tidy_location()) which includes some stringr functions, you can find more details about this function at the end of this post. It outputs the location in this format [Town]([Zip code]). Even though all the locations have been homogenized, it can not be plotted on a map (we need the longitude and latitude). To get the latitude and longitude with the town name and zip code I used the geocode() function from tidygeocoder package. # Extract coordinates from town name final_df % mutate(Loc_tidy_fr = paste(Loc_tidy, 'France')) %>% geocode(Loc_tidy_fr, method = 'arcgis', lat = latitude , long = longitude) %>% select(- Loc_tidy_fr) Distribution of Data Science jobs in France We can now represent the number of Data Science jobs by departments: # Map of France from rnaturalearth package france % filter(!name %in% c("Guyane française", "Martinique", "Guadeloupe", "La Réunion", "Mayotte")) # Transform location to st point test % group_by(region) %>% summarize(Job_number=n()) %>% mutate(Job_number = cut(Job_number, my_breaks)) %>% ggplot() + geom_sf(aes(fill=Job_number), col='grey', lwd=0.2) + scale_fill_brewer("Job number",palette = "GnBu") + theme_bw() It is really interesting to see that the distribution of jobs is quite heterogeneous in France. The majority of the jobs are concentrated in a few departments that include a large city. It is expected as most of the jobs are proposed by large company that are often installed in the proximity of important cities. Interactive map We can go further and plot an interactive map with leaflet which allows us to search dynamically for a job post: # Plot leaflet map final_df %>% mutate(pop_up_text = sprintf("%s %s", Job_title, Company)) %>% # Make popup text leaflet() %>% setView(lng = 2.36, lat = 46.31, zoom = 5.2) %>% # Center of France addProviderTiles(providers$CartoDB.Positron) %>% addMarkers( popup = ~as.character(pop_up_text), clusterOptions = markerClusterOptions() ) Analyzing job descriptions Nowadays most of the resumes are scanned and interpreted by an applicant tracking system (ATS). To make things simple, this system looks for key words in your resume and assess the match with the job you are applying for. It is therefore important to describe your experiences with specific key words to improve the chances of getting to the next step of the hiring process. But what key words should I include in my resume ? Let’s answer this question by analyzing the job descriptions of data scientist jobs. Downloading and cleaning each job description First we download the full description of each job by navigating through all the URL listed in our table. We then clean and homogenize the description with a custom function: # Loop through all the URLs job_descriptions" />
A few weeks ago, I started looking for a data scientist position in industry. My first moves were: To look at the job posts on websites such as Indeed To update my resume After reading numerous job posts and work several hours on my resume, I wondered if I could ...
" />
A few weeks ago, I started looking for a data scientist position in industry. My first moves were: To look at the job posts on websites such as Indeed To update my resume After reading numerous job posts and work several hours on my resume, I wondered if I could ...
">
% # Clean the data to only get numbers substr(start = 2, stop = 8) %>% as.numeric() ## [1] 1761 For now, we can only scrape the data from the first page. However, I am interested in all the job posts and I need to access the other pages ! After navigating through the first 3 pages of listed jobs, I remarked a pattern in the URL address (valid at the time of writing), this means that with a line of code, I can produce a list containing the URLs for the first 40 pages. Once I have the list, the only thing left is to loop over all the URLs with some delay (good practice for web-scraping), collect the data and clean it with custom functions (at the end of the post): # Creating URL link corresponding to the first 40 pages base_url = "https://fr.indeed.com/jobs?q=data%20scientist&l=France&start=" url_list % html_element("h2") %>% html_text2() %>% str_replace(".css.*;\\}", "") # URL for job post job_url % html_elements(css = ".mosaic-provider-jobcards .result")%>% html_elements(css = ".resultContent") %>% html_element("h2") %>% html_element("a") %>% html_attr('href') %>% lapply(function(x){paste0("https://fr.indeed.com", x)}) %>% unlist() # Data about company company_info % html_elements(css = ".mosaic-provider-jobcards .result")%>% html_elements(css = ".resultContent")%>% html_element(css = ".company_location")%>% html_text2() %>% lapply(FUN = tidy_comploc) %>% # Function to clean the textual data do.call(rbind, .) # Data about job description job_desc % html_elements(css = ".mosaic-provider-jobcards .result")%>% html_element(css =".slider_container .jobCardShelfContainer")%>% html_text2() %>% tidy_job_desc() # Function to clean the textual data related to job desc. # Data about salary (when indicated) salary_hour % html_elements(css = ".mosaic-provider-jobcards .result .resultContent")%>% html_element(css = ".salaryOnly") %>% html_text2() %>% lapply(FUN = tidy_salary) %>% # Function to clean the data related to salary do.call(rbind, .) # Job posts in the same format final_df % ggplot(aes(x = Company)) + geom_point(aes(y = Mean_salary), colour = "#267266") + geom_linerange(aes(ymin = Low_salary, ymax = High_salary)) + geom_hline(aes(yintercept = median(Mean_salary)), lty=2, col='red', alpha = 0.7) + scale_y_continuous(labels = euro) + ylab("Monthly income") + xlab("") + coord_flip() + theme_bw(base_size = 8) The median monthly salary is around 3700 euros. As you can see the salaries can vary a lot depending on the company. This is partly due because I didn’t make distinction between the different data science jobs (data scientist, data analyst, data engineer, senior or lead). Salary by job title We can plot the same graph but instead of grouping by company we can group by job title: final_df %>% filter(Low_salary > 1600) %>% # To remove internships and freelance works select(Job_title_c, Low_salary, High_salary, Job_type) %>% group_by(Job_title_c) %>% summarize_if(is.numeric, ~ mean(.x, na.rm = TRUE)) %>% mutate(Mean_salary = rowMeans(cbind(Low_salary, High_salary), na.rm = T), Job_title_c = fct_reorder(Job_title_c, desc(-Mean_salary))) %>% ggplot(aes(x = Job_title_c, y = Mean_salary)) + geom_point(aes(y = Mean_salary), colour = "#267266") + geom_linerange(aes(ymin = Low_salary, ymax = High_salary)) + #geom_label(aes(label = n, Job_title_c, y = 1500), data = count_df) + scale_y_continuous(labels = euro) + theme_bw(base_size = 12) + xlab("") + ylab("Monthly Income") + coord_flip() We clearly see the differences in proposed salaries depending on the job title: data scientists seem to earn slightly more in average than data analysts. The companies also seem to propose higher salaries for jobs with more responsibilities or requiring more experiences (senior, lead). Salary depending on location: full remote, hybrid, on site ? Finally we can plot the salaries depending on the location (full remote, hybrid, on site) to see if it has an impact: # Tidy the types and locations of listed jobs final_df 1600), Job_type) final_df %>% filter(Low_salary > 1600) %>% drop_na(Location) %>% mutate(Mean_salary = rowMeans(cbind(Low_salary, High_salary), na.rm = T), Job_type = as.factor(Job_type)) %>% ggplot(aes(x = Job_type, y = Mean_salary)) + geom_boxplot(na.rm = TRUE) + geom_label(aes(label = n, Job_type, y = 5500), data = count_df) + scale_y_continuous(labels = euro) + theme_bw(base_size = 12) + xlab("Job Type") + ylab("Income") It is worth noting that most of the jobs proposed in France are on site jobs. The median salary for this type of jobs is slightly lower than hybrid jobs. The salary distribution of full remote and hybrid jobs must be taken with care as it is only represented by 12 job posts. Mapping job locations During my job search, I was frustrated not to see a geographical map regrouping the locations of all the proposed jobs. Such map could help me greatly in my search. Let’s do it ! First, we must tidy and homogenize the locations for all the job posts. To this end, I made a custom function (tidy_location()) which includes some stringr functions, you can find more details about this function at the end of this post. It outputs the location in this format [Town]([Zip code]). Even though all the locations have been homogenized, it can not be plotted on a map (we need the longitude and latitude). To get the latitude and longitude with the town name and zip code I used the geocode() function from tidygeocoder package. # Extract coordinates from town name final_df % mutate(Loc_tidy_fr = paste(Loc_tidy, 'France')) %>% geocode(Loc_tidy_fr, method = 'arcgis', lat = latitude , long = longitude) %>% select(- Loc_tidy_fr) Distribution of Data Science jobs in France We can now represent the number of Data Science jobs by departments: # Map of France from rnaturalearth package france % filter(!name %in% c("Guyane française", "Martinique", "Guadeloupe", "La Réunion", "Mayotte")) # Transform location to st point test % group_by(region) %>% summarize(Job_number=n()) %>% mutate(Job_number = cut(Job_number, my_breaks)) %>% ggplot() + geom_sf(aes(fill=Job_number), col='grey', lwd=0.2) + scale_fill_brewer("Job number",palette = "GnBu") + theme_bw() It is really interesting to see that the distribution of jobs is quite heterogeneous in France. The majority of the jobs are concentrated in a few departments that include a large city. It is expected as most of the jobs are proposed by large company that are often installed in the proximity of important cities. Interactive map We can go further and plot an interactive map with leaflet which allows us to search dynamically for a job post: # Plot leaflet map final_df %>% mutate(pop_up_text = sprintf("%s %s", Job_title, Company)) %>% # Make popup text leaflet() %>% setView(lng = 2.36, lat = 46.31, zoom = 5.2) %>% # Center of France addProviderTiles(providers$CartoDB.Positron) %>% addMarkers( popup = ~as.character(pop_up_text), clusterOptions = markerClusterOptions() ) Analyzing job descriptions Nowadays most of the resumes are scanned and interpreted by an applicant tracking system (ATS). To make things simple, this system looks for key words in your resume and assess the match with the job you are applying for. It is therefore important to describe your experiences with specific key words to improve the chances of getting to the next step of the hiring process. But what key words should I include in my resume ? Let’s answer this question by analyzing the job descriptions of data scientist jobs. Downloading and cleaning each job description First we download the full description of each job by navigating through all the URL listed in our table. We then clean and homogenize the description with a custom function: # Loop through all the URLs job_descriptions " />

Images Powered by Shutterstock

The Data Daily

Optimizing my search for Data scientist jobs by scraping Indeed with R | R-bloggers