Sunday, 1 November 2020

Scraping Daraz.lk

I have always wanted to learn how to write a code for a web scraper and I finally got around to doing it while at home during lock down. Here is the process on how I got started.

GitHub site: https://github.com/iamJohnnySam/WebScraping

I came across this helpful page on towards data science and got started with learning more about the web features on daraz-Mall and how to navigate them with Beautiful Soup. Having built my initial confidence from reading some more web pages I wanted to scrape daraz.lk/daraz-mall and quickly realized I bit far more than I can handle. The web scraping was successful but the site returned blanks. This was probably because I was using Googles Colab environment to execute my code. 


I next decided to take this one step at a time.

Installing tools:

The first step is to install Selenium. I installed mine on my Anaconda environment to use with the Spyder editor.


Next, for Selenium to use your web browser you need to install the correct web driver. Installation instructions can be found on the Selenium installation page. I have decided to use the edge or chrome browser for which the web driver can be downloaded from Microsoft or Google.

A good tutorial on how to install the chrome driver can be found below. Drivers for other browsers can be installed just as easily.

Next we can proceed to install Beautiful soup on anaconda with the instructions on  https://anaconda.org/anaconda/beautifulsoup4 or with the instructions on https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-beautiful-soup

Writing Code:

Here I have attempted to write a code that simply scrapes the daraz.lk/daraz-mall web page and obtains a list of prices shown on this page. Here's the guide I used so that it may help you too: https://www.edureka.co/blog/web-scraping-with-python/.
 
First I have imported all necessary libraries and connected the driver to the required website as shown below.

from selenium import webdriver
from bs4 import BeautifulSoup as BS

driver = webdriver.Chrome("chromedriver.exe")
driver.get("https://www.daraz.lk/daraz-mall/")


Next, I have passed the data I received from this website to BeautifulSoup in order to filter and navigate the website.

content = driver.page_source
soup = BS(content, features="lxml")


I have then inspected the website manually and found that the prices in this page are always stored in a <div> tag with class "store-product-price". I have tried this out with the code below to verify by observations.

print(soup.find_all("div", class_="store-product-price"))

Finally, I have written a simple for loop which cycles through all the <div> tags and compiles a list of all prices on the website.

price = []

c = 1
for val in soup.find_all("div", class_="store-product-price"):
    price.append(val.text)
    c +=1

print(price)


This way, it can be seen that all the prices have been extracted to the array I have specified.

You can see the full code on my GitHub.

No comments:

Post a Comment