Web scraping with python: Using the BeautifulSoup library
Web scraping is the process of extracting data from websites. It involves fetching web pages and extracting the desired information from the HTML or XML content. This can be useful for a variety of purposes, such as collecting data for analysis, monitoring changes on a website, or aggregating information from multiple sources.
BeautifulSoup is a Python library designed for parsing HTML and XML documents. It creates parse trees from page source codes that can be used to extract data easily. BeautifulSoup is favored because it’s easy to use, flexible and Robust (It can handle poorly-formed HTML and automatically convert it into a parseable format.)
Installing BeautifulSoup
pip install beautifulsoup4 requests
Project: Displaying smart tv’s sold in the Kilimall Kenya ecommerce shop
Kilimall is an e-commerce platform based in Kenya that operates an online marketplace. It was founded in 2014 and has become one of the largest e-commerce platforms in East Africa, offering a wide range of products including electronics, fashion, home and garden items, and more. Kilimall connects sellers and buyers across the region and provides services such as secure payments, logistics, and customer support.
Step 1: Fetch the Web Page
First, import the necessary libraries and fetch the web page content.
import requests
from bs4 import BeautifulSoup
import pandas as pd
Step 2. Send an HTTP Request
First, we need to fetch the HTML content of the page. We’ll use the requests
library to send an HTTP GET request to the URL.
# Define the URL and headers
url = 'https://www.kilimall.co.ke/search-result?id=2069&form=category&ctgName=TV,Audio&Video'
response = requests.get(url)
Step 3. Parse the HTML Content
With the HTML content retrieved, we use BeautifulSoup to parse it. BeautifulSoup helps us navigate and search the document.
soup = BeautifulSoup(response.text, 'html.parser')
Step 4. Extract the data
We’ll use BeautifulSoup’s find_all
method to extract all movie titles. Inspect the page using your browser’s developer tools to find the HTML structure.
#The data we are extracting is the product-description, price, number of reviews and tags
listings = soup.find_all('div', class_='listing-item')
for item in listings:
title = item.find('p', class_='product-title').get_text(strip=True)
price = item.find('div', class_='product-price').get_text(strip=True)
reviews = item.find('span', class_='reviews').get_text(strip=True) if item.find('span', class_='reviews') else 'No reviews'
logistics = item.find('div', class_='logistics-tag').get_text(strip=True) if item.find('div', class_='logistics-tag') else 'No logistics info'
Step 5: Handle Data Efficiently
When dealing with larger data sets or multiple pages, it’s essential to handle data efficiently. Here’s how you can modify the script to handle multiple pages and save the results to a CSV file.
import csv
# Define the URL and headers
url = 'https://www.kilimall.co.ke/search-result?id=2069&form=category&ctgName=TV,Audio&Video'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
# Open the CSV file
with open('products.csv', 'w', newline='', encoding='UTF8') as f:
writer = csv.writer(f)
writer.writerow(['Product Title', 'Price', 'Reviews', 'Logistics'])
listings = soup.find_all('div', class_='listing-item')
for item in listings:
title = item.find('p', class_='product-title').get_text(strip=True)
price = item.find('div', class_='product-price').get_text(strip=True)
reviews = item.find('span', class_='reviews').get_text(strip=True) if item.find('span', class_='reviews') else 'No reviews'
logistics = item.find('div', class_='logistics-tag').get_text(strip=True) if item.find('div', class_='logistics-tag') else 'No logistics info'
conclusion
BeautifulSoup, in conjunction with Python’s requests
library, offers a robust solution for web scraping. It is a straightforward and powerful way to collect data from websites. By understanding how to send requests, parse HTML, and extract information, you can automate the data collection process and gain insights from web content efficiently. By following best practices and handling errors gracefully, you can build efficient and effective web scraping solutions. Whether you're collecting data for analysis or monitoring web content, BeautifulSoup provides the tools you need to get started. Remember to follow best practices and ethical guidelines to ensure your web scraping activities are respectful and compliant with website policies.
NB// For educational purposes.