Tuesday 15 December 2020

Scraping University Courses (part 1)

 As a prospective student looking at starting a degree in an university, it is important to know which universities are offering the courses that you are looking for. In this project I am planning on writing some code to scrape through university pages to get a list of all courses that they offer. As my undergraduate is from the UK, I will be focusing only on UK universities at this time.

Need help getting started with web scraping? See my article here - Scraping daraz.lk

For this project I have chosen the universities here:

  1. University of Manchester
  2. University of Cambridge
  3. University of Hertfordshire
  4. University of Chester
  5. Bangor University
  6. Ulster University
  7. Northumbria University
  8. University of Roehampton
  9. Solent University
  10. University of Essex

Location!

As the first step of the process in scraping I identified the web page on each website where the undergraduate or postgraduate courses are listed. If I take the University of Manchester for example, their undergraduate list of courses for the year 2021 is listed on the page here: (https://www.manchester.ac.uk/study/undergraduate/courses/2021/).




Next I identify the specific HTML tag that is used to hold the content for the course title. This is usually the same for all course title as it is easy for the universities to use an automated process to populate the courses. We can do this by using the "inspect" function on your standard web browser. This will open up the panel on the side where you can look at the underlying HTML to identify the name of the tag and class which holds the name of the course. In this case it is within <div> tags and a class called "title".

Base Code

To test this out, I wrote a simple python program with the help of the python packages Selenium and Beautiful Soup and the Chrome web driver. Next I moved the remaining web scraping code in to a function so that I use it over and over again. To see the output for verification I added a loop which prints out the courses to the terminal one by one.

See code on my GitHub.

If you try out this code with the correct python packages installed, you will see that each of the courses in the University of Manchester undergraduate catalog are printed to the terminal one by one. By adding a few modifications to this, we can obtain the link to the course page as well using the same loop and then populate it to an array.

See code on my GitHub.

Although this code successfully obtains all the information that is needed for the University of Manchester, it can be seen that as the project progresses new functionality needs to be added in order to tackle the different web page structures on different sites.

For example, the second University, University of Cambridge maintains their courses on a simple table and does not have any class identifiers. In this case, I have modified the code to handle both styles of data.

This edit can be seen on the code on my GitHub.


Go to Part 2

No comments:

Post a Comment