Module 01 — Web Scraping¶
Overview¶
This module teaches you how to fetch web pages, parse HTML, extract structured data, handle pagination, and save results to CSV files. You will use real libraries (requests, BeautifulSoup) against a real website designed for scraping practice.
Every project targets books.toscrape.com, a safe sandbox site that exists specifically for people learning web scraping. You will not get in trouble for scraping it.
Prerequisites¶
Complete Level 2 before starting this module. You should be comfortable with:
- Functions and return values
- Reading and writing files
- Dictionaries and lists
- Basic testing with pytest
- Running scripts from the command line
Learning objectives¶
By the end of this module you will be able to:
- Fetch a web page with
requestsand inspect the response. - Parse HTML with BeautifulSoup and extract elements using tags, classes, and CSS selectors.
- Build structured data (list of dicts) from scraped content.
- Follow pagination links to scrape multiple pages with rate limiting.
- Write scraped data to a CSV file using
csv.DictWriter.
Projects¶
| # | Project | What you learn |
|---|---|---|
| 01 | Fetch a Webpage | requests.get(), status codes, response body |
| 02 | Parse HTML | BeautifulSoup, find(), find_all(), CSS selectors |
| 03 | Extract Structured Data | Scraping tables, building list of dicts, star ratings |
| 04 | Multi-Page Scraper | Pagination, time.sleep(), collecting across pages |
| 05 | Save to CSV | csv.DictWriter, deduplication, file output |
Work through them in order. Each project builds on the previous one.
Setup¶
Create a virtual environment and install dependencies before starting:
cd projects/modules/01-web-scraping
python -m venv .venv
source .venv/bin/activate # macOS/Linux
.venv\Scripts\activate # Windows
pip install -r requirements.txt
See concepts/virtual-environments.md for a full explanation of virtual environments.
Dependencies¶
This module requires three packages (listed in requirements.txt):
- requests — makes HTTP requests simple. You call
requests.get(url)and get a response object back. - beautifulsoup4 — parses HTML into a tree you can search. The import name is
bs4. - lxml — a fast HTML/XML parser that BeautifulSoup uses under the hood.
A note on web scraping ethics¶
Web scraping is a powerful tool, but it comes with responsibilities:
- Always check a site's
robots.txtbefore scraping (e.g.,http://example.com/robots.txt). - Respect rate limits. Add delays between requests so you do not overload servers.
- Do not scrape personal data or content behind login walls without permission.
- The site we use in this module (books.toscrape.com) is explicitly designed for scraping practice.