Module 01 / Project 02 — Parse HTML¶

Learn Your Way¶

Read	Build	Watch	Test	Review	Visualize	Try
—	This project	—	—	Flashcards	—	—

Focus¶

Creating a BeautifulSoup object from HTML
find() and find_all() to locate elements
CSS selectors with select()
Extracting text and attributes from elements

Why this project exists¶

Raw HTML is a mess of tags, attributes, and nesting. BeautifulSoup turns that mess into a tree structure you can search. This project teaches you to find specific elements on a page — the single most important skill in web scraping. You will extract book titles and prices from a real webpage.

Run¶

cd projects/modules/01-web-scraping/02-parse-html
python project.py

Expected output¶

Fetching http://books.toscrape.com/ ...
Parsing HTML with BeautifulSoup...

Found 20 books on the page:

  1. A Light in the Attic                         £51.77
  2. Tipping the Velvet                            £53.74
  3. Soumission                                    £50.10
  ...
 20. (last book title)                             £XX.XX

Done. Extracted 20 books.

The exact titles and prices depend on the current page content, but you should see 20 books listed.

Alter it¶

Instead of printing the price, print the star rating. Each book has a <p> tag with a class like star-rating Three. Extract and print the rating word (One, Two, Three, etc.).
Use soup.select() with a CSS selector instead of find_all(). For example, soup.select("article.product_pod h3 a") selects all title links. Try rewriting the extraction using only CSS selectors.
Extract and print the URL of each book's detail page (the href attribute on the title link).

Break it¶

Change the parser from "lxml" to "html.parser" (Python's built-in). Does the output change? What if the HTML were malformed — which parser would handle it better?
Search for a tag that does not exist: soup.find("div", class_="nonexistent"). What does it return? What happens if you try to call .text on that result?
Remove the import for BeautifulSoup and run the script. Read the error.

Fix it¶

Before calling .text on a found element, add a check: if element is not None. Print "Not found" if the element is missing.
If the page fetch fails (status code is not 200), skip the parsing step entirely and print an error message.
Restore any imports you removed.

Explain it¶

What does BeautifulSoup(html, "lxml") do? What is the second argument for?
What is the difference between find() and find_all()?
How do you get the text content of a tag? How do you get an attribute like href?
What is a CSS selector and why might you prefer select() over find_all()?

Mastery check¶

You can move on when you can:

Parse any HTML string with BeautifulSoup without looking up the syntax.
Find elements by tag name, class, and CSS selector.
Extract both text content and attributes from elements.
Handle the case where an element is not found on the page.

Next¶

Project 03 — Extract Structured Data