Web Scraping Project: Create a Job Board with Beautiful Soup & Python

José Manuel García Portillo
Analytics Vidhya
Published in
10 min readNov 8, 2020

--

Image by RitaE from Pixabay

Introduction

The incredible amount of data on the Internet is a rich resource for any field of research or personal interest. To effectively harvest that data by using an automated process is known as web scraping.

Say we want to get some content from one or more webpages. Beautiful Soup helps you pull particular content from a webpage, remove the HTML markup, and save the information. It is a tool for web scraping that helps you clean up and parse the documents you have pulled down from the web.

Data

For this project, I won’t be using any .csv or .xlsx data files as I have been doing in previous Data Science projects. This time I will be scraping the data I am interested in and creating my own personal dataset.

From that point onwards, it is just a matter of creating a Data Frame using Pandas and working with the data we have to extract some useful insights.

Methodology

First of all, let’s start by stating the objective of the project: to create a Job Board with information on job offers that appear on the Internet.

For the sake of the argument, let’s say I am interested on “Data Analysis” job offers in Japan and I want to get the data out from two different known job sites.

What kind of data do I want to obtain? Let’s keep it simple: Title, URL, Update date, Location, Salary, Type of job, Experience and Skills requirement.

Disclaimer: any of the website names and urls in this article are fictional.

Import Libraries

There are two main libraries needed when web scraping with Beautiful Soup:

  1. Requests
  2. Bs4 (Beautiful Soup)

Other useful libraries are:

3. Urllib.parse (to join “relative urls” with the “base url” in order to create a “full url”)

4. Time (requesting a lot of information in a short period of time is not a good practice as it can hurt the website. We use this library to ensure there is a span of time between requests)

5. Pandas & Numpy (in order to play around with the data we scrap from the websites)

Useful libraries when Web Scraping

Outline of the project

First, we create a Dictionary containing the two websites we are interested in scraping data from after searching for Data Analysis job offers in Japan:

Dictionary with the websites we want to scrap data from

The overall concept is to loop over the values of the dictionary. By using each of the values, I will make my request. In the request it is even possible to specify the page number where we want to take data from.

If everything goes well (the request value is 200), we get the HTML out of the content of that request. And from then on we can use Beautiful Soup to parse that HTML so that we can extract information out of it.

Outline of the For Loop for the project

Thus, if by searching “Data Analysis” we get 70 results divided in 10 results per page (with a total of 7 pages), we will be reading each page and moving to the next one until we finally arrive to the 7th page.

But the for loop won’t stop there and will continue onto the 8th page. That’s when the last part of the loop comes in handy. Since there won’t be any more results, the request’s value will be an error (most likely 400). It is then that the for loop will break and move to the next value of the dictionary.

Storing the data

After extracting the data we are interested in, let’s say the “Title” for example, we need somewhere to store it: empty lists.

I will be storing each data in different empty lists and, once all the extracting is done, they will be used in order to create a Data Frame where the final Job Board will be visualised.

Empty lists to store the extracted data

Extracting the data

Following the example earlier, let’s imagine our search results are 7 pages of 10 job offers per page (which would make a total of 70 job offers).

In order to extract relevant data the steps will be:

  1. Get the title of each job offer in the page.
  2. Go into each of those 10 job offers and get more specific data like the type of skills required, for example.
  3. Move onto the next page with the help of the for loop (also called Pagination) as it was explained before.

Now let’s go into the first step!

Keep in mind that I will be using the variable “soup” that was created at the beginning of the project and that contains the parsed HTML of the web page. The information that we want to obtain is in there.

Also, since most of the kernel is in a list comprehension format, it might be difficult to understand so I will be reviewing it step by step.

Title

  1. To get the title we use the “find_all” command that let us select all the “a” tags (hyperlinks) that go together with the selected class. The result comes in a list format.
  2. Since we get a list as the output, we do a for loop with “t” looping over each of the results we got from the first step.
  3. For each of the values of “t”, we want to get its “text” which will be the “Title” string we are looking for. We can also strip any leading or trailing spaces and finally replace “\n” character if it appears, in order to leave the “Title” string perfectly clean.
  4. Notice that everything is inside brackets so we need another for loop (that’s where the lambda function comes in) to get each of the results from the third step and append it to the empty list we created before.
Extracting the title of each job post

URL

  1. Same step we did when finding for the “Title” but with different parameters. The result comes in a list format since it is the result of a “find_all” command.
  2. Since we get a list as the output, we do a for loop with “l” looping over each of the results we got from the first step.
  3. From each value of “l”, we want to get the “href” tag that contains the relative URL and put it in brackets.
  4. We do a final loop with “r” looping over the results of the third step and we use the “urljoin” function that was introduced when importing the libraries. The idea is to join each relative URL with the base URL (the value in the dictionary) to create a full URL.
  5. That’s the URL we want, since if we click on that URL we will be redirected to the specific web page of the job offer.
  6. Finally, just like with Titles, we append it to the empty list we created beforehand.
Extracting the full URL of each job post

Scraping the rest of the information: the second For Loop

Now into the second step!

We got ourselves a list with the first 10 full URLs from the steps we took before. And now we want to go into each one of those full URLs and keep scraping the data we want. How do we do that? The answer is: another For Loop.

For each full URL, we will make a request, then get the HTML out of the content of that request. And, just like we did before, from then on we can use Beautiful Soup to parse that HTML.

For loop to go into each job offer post

Update date / Location / Salary / Experience

Since these four type of informations are pretty similar to scrape, I grouped them all together for the explanation:

  1. Remember we are now in the second For Loop, so we will not be working with the “soup” variable but with the “r_soup” we just created before.
  2. Since we are now scraping from each job offer individually, we are looking for one result for “date of update”, one result for “location” and so on. That’s why even though we used “find_all” in this example, we could have used “find” as well. If we used “find” it would only give us the first result but that would be enough. Also, the output wouldn’t be contained in a list, so it would make it easier to obtain the text format we want.
  3. Anyhow, we got the results in a list after using “find_all”. We have only one result so we cannot use a For Loop. Since the “text” command only works on strings, we need first to get the data inside the list ([0]) and then get the text and strip any leading or trailing spaces.
  4. Finally, we append the result obtained from the third step into the empty list created beforehand, as usual.
Extracting several data from the job offer post

Type of job

This one is a tricky one since there are times when there is no information available and we get an IndexError when trying to scrape it.

So that our kernel keeps functioning properly even with an Error popping up, we use Exception Handling.

That is, we use a “try” statement where we tell the program to scrap what we want. There are two possible outcomes here: either an IndexError occurs, so we create an “exception” and we append a NaN value to the empty list; or no IndexError comes up and we just append the data we scraped to the empty list.

Extracting the type of job for the job offer post

Job requirements

Now onto the last data to scrap! This one is the most challenging to extract since what we want to get is not a certain specific class but a paragraph.

To top it off, there are several paragraphs in the job offer post and only one paragraph corresponds to the job requirements.

  1. Usually, the format of a job post is a big table and inside it there are headers for each type of information. The headers have naturally each of their own content following them. In our case we are looking for “Job Requirements” so we will start by looking for that header.
  2. We first get the whole table where all the contents of the job offer is included. If the “Job Requirements” header is in it, then the data we are looking for should be there as well. If the header is not, then information related will not exist so we will just append a NaN value to the empty list we created.
  3. Let’s say that the “Job Requirements” header is in the table. The next step will be to find all the rows (“tr” tag) of the table and check which of these rows has “Job Requirements” in it.
  4. Once we find which row has the “Job Requirements” header, we go to the cell (“td” tag) that contains the information related to it. The only thing left now is to get the text as usual, strip the spaces and get rid of any character that might hinder the readability of the contents.
  5. Once we got the final data, it is time to close the For Loop. We have two elements to take care of: “i” (so that it is possible to go over each full URL to scrap the contents we want) and “page” (when one page is finished after scraping the data of each job result page, we go to the following one and so on).
Extracting the data and closing the loop

Results

Up until now, I have been explained step-by-step how the scraping should be done with the results of a certain search in a website, how the Pagination works and the way to get the data we want from each job offer individually. Not only that, the functioning of the big first For Loop to loop over the two websites we want to scrape data from was also described.

Since the modus operandi is more or less the same, I won’t be delving into the scraping methods for the results of the second website. If you are interested, please check my GitHub page at the end of this article to find the full kernel of the project.

Now it is time to give a shape to the data we scraped and that’s achieved by putting it all together in a DataFrame. With this, we will create the final Job Board.

As a plus, what if we really like the conditions of a job offer and we want to find more about it or maybe even apply to it? It will be tiring to copy paste the URL right? We can even make it clickable with the “style.format” function!

Creating a Data Frame with the data we scraped

Conclusion

The legality of Web Scraping is in a “grey” area, thus, it depends on us how to use the tools at hand.

A great example of illegal web scraping is when you try to scrape private user data. Private data is usually not accessible to everyone, with several examples involving data that would be obtained from a personal accounts on social media for example.

I am going to wear the “white hat” here and emphasise that you should always check the legality of your actions first.

Being aware of the legal issues is of paramount importance before becoming involved with, or setting up, such businesses. Only then you can prevent falling into a sanctionable activity or being a victim of it.

For more information on the Kernel of this project please visit: https://github.com/josem-gp

Also I may keep posting about this field, this time doing projects with APIs, Scrapy and Selenium. Stay tuned!

--

--

José Manuel García Portillo
Analytics Vidhya

I’m a Full Stack Web Developer with a background in the teaching industry.Passionate about solving real-life problems and improving community’s quality of life.