Module 01 — Web Scraping¶

Overview¶

This module teaches you how to fetch web pages, parse HTML, extract structured data, handle pagination, and save results to CSV files. You will use real libraries (requests, BeautifulSoup) against a real website designed for scraping practice.

Every project targets books.toscrape.com, a safe sandbox site that exists specifically for people learning web scraping. You will not get in trouble for scraping it.

Prerequisites¶

Complete Level 2 before starting this module. You should be comfortable with:

Functions and return values
Reading and writing files
Dictionaries and lists
Basic testing with pytest
Running scripts from the command line

Learning objectives¶

By the end of this module you will be able to:

Fetch a web page with requests and inspect the response.
Parse HTML with BeautifulSoup and extract elements using tags, classes, and CSS selectors.
Build structured data (list of dicts) from scraped content.
Follow pagination links to scrape multiple pages with rate limiting.
Write scraped data to a CSV file using csv.DictWriter.

Projects¶

#	Project	What you learn
01	Fetch a Webpage	`requests.get()`, status codes, response body
02	Parse HTML	BeautifulSoup, `find()`, `find_all()`, CSS selectors
03	Extract Structured Data	Scraping tables, building list of dicts, star ratings
04	Multi-Page Scraper	Pagination, `time.sleep()`, collecting across pages
05	Save to CSV	`csv.DictWriter`, deduplication, file output

Work through them in order. Each project builds on the previous one.

Setup¶

Create a virtual environment and install dependencies before starting:

cd projects/modules/01-web-scraping
python -m venv .venv
source .venv/bin/activate    # macOS/Linux
.venv\Scripts\activate       # Windows
pip install -r requirements.txt

See concepts/virtual-environments.md for a full explanation of virtual environments.

Dependencies¶

This module requires three packages (listed in requirements.txt):

requests — makes HTTP requests simple. You call requests.get(url) and get a response object back.
beautifulsoup4 — parses HTML into a tree you can search. The import name is bs4.
lxml — a fast HTML/XML parser that BeautifulSoup uses under the hood.

A note on web scraping ethics¶

Web scraping is a powerful tool, but it comes with responsibilities:

Always check a site's robots.txt before scraping (e.g., http://example.com/robots.txt).
Respect rate limits. Add delays between requests so you do not overload servers.
Do not scrape personal data or content behind login walls without permission.
The site we use in this module (books.toscrape.com) is explicitly designed for scraping practice.