Module 01 / Project 02 — Parse HTML¶
Learn Your Way¶
| Read | Build | Watch | Test | Review | Visualize | Try |
|---|---|---|---|---|---|---|
| — | This project | — | — | Flashcards | — | — |
Focus¶
- Creating a BeautifulSoup object from HTML
find()andfind_all()to locate elements- CSS selectors with
select() - Extracting text and attributes from elements
Why this project exists¶
Raw HTML is a mess of tags, attributes, and nesting. BeautifulSoup turns that mess into a tree structure you can search. This project teaches you to find specific elements on a page — the single most important skill in web scraping. You will extract book titles and prices from a real webpage.
Run¶
Expected output¶
Fetching http://books.toscrape.com/ ...
Parsing HTML with BeautifulSoup...
Found 20 books on the page:
1. A Light in the Attic £51.77
2. Tipping the Velvet £53.74
3. Soumission £50.10
...
20. (last book title) £XX.XX
Done. Extracted 20 books.
The exact titles and prices depend on the current page content, but you should see 20 books listed.
Alter it¶
- Instead of printing the price, print the star rating. Each book has a
<p>tag with a class likestar-rating Three. Extract and print the rating word (One, Two, Three, etc.). - Use
soup.select()with a CSS selector instead offind_all(). For example,soup.select("article.product_pod h3 a")selects all title links. Try rewriting the extraction using only CSS selectors. - Extract and print the URL of each book's detail page (the
hrefattribute on the title link).
Break it¶
- Change the parser from
"lxml"to"html.parser"(Python's built-in). Does the output change? What if the HTML were malformed — which parser would handle it better? - Search for a tag that does not exist:
soup.find("div", class_="nonexistent"). What does it return? What happens if you try to call.texton that result? - Remove the
importfor BeautifulSoup and run the script. Read the error.
Fix it¶
- Before calling
.texton a found element, add a check:if element is not None. Print "Not found" if the element is missing. - If the page fetch fails (status code is not 200), skip the parsing step entirely and print an error message.
- Restore any imports you removed.
Explain it¶
- What does
BeautifulSoup(html, "lxml")do? What is the second argument for? - What is the difference between
find()andfind_all()? - How do you get the text content of a tag? How do you get an attribute like
href? - What is a CSS selector and why might you prefer
select()overfind_all()?
Mastery check¶
You can move on when you can:
- Parse any HTML string with BeautifulSoup without looking up the syntax.
- Find elements by tag name, class, and CSS selector.
- Extract both text content and attributes from elements.
- Handle the case where an element is not found on the page.