Module 01 / Project 03 — Extract Structured Data¶
Learn Your Way¶
| Read | Build | Watch | Test | Review | Visualize | Try |
|---|---|---|---|---|---|---|
| — | This project | — | — | Flashcards | — | — |
Focus¶
- Scraping multiple fields per item
- Building a list of dictionaries from scraped data
- Mapping CSS classes to meaningful values (star ratings)
- Printing formatted tabular output
Why this project exists¶
Scraping one field at a time is useful for learning, but real scraping tasks require extracting multiple fields per item and organizing them into structured data. This project teaches you to build a list of dictionaries — the standard Python data structure for tabular data — from a scraped page. You will extract title, price, rating, and availability for every book on the page.
Run¶
Expected output¶
Fetching http://books.toscrape.com/ ...
Parsing page and extracting book data...
# Title Price Rating Available
--- ---------------------------------------- -------- ------- ---------
1 A Light in the Attic £51.77 3 star In stock
2 Tipping the Velvet £53.74 1 star In stock
3 Soumission £50.10 1 star In stock
...
20 (last title) £XX.XX X star In stock
Extracted 20 books with 4 fields each.
Done.
Alter it¶
- Add a fifth field: the book's detail page URL (the
hrefattribute). Include it in each dictionary and print it as an extra column. - Filter the output to only show books rated 4 or 5 stars. Print how many books were filtered out.
- Sort the books by price (lowest first) before printing. You will need to convert the price string to a float — strip the pound sign first.
Break it¶
- Change the rating mapping so it is missing one entry (e.g., remove "Three"). What happens when a book with that rating is processed?
- Try to convert a price string to a float without removing the pound sign. What error do you get?
- Change
find_all("article")tofind_all("div"). Does it still work? Why or why not?
Fix it¶
- Add a fallback for unknown ratings: if the rating class is not in your mapping, set it to 0 instead of crashing.
- Use
price_text.replace("£", "")orprice_text[1:]to strip the currency symbol before converting to float. - Add a check: if
find_all()returns an empty list, print a warning that the page structure may have changed.
Explain it¶
- Why is a list of dictionaries a good data structure for scraped data?
- How did you map CSS class names (like "Three") to numeric values?
- What would happen if the website changed its HTML structure? How would you detect that?
- Why is it important to strip non-numeric characters before converting strings to numbers?
Mastery check¶
You can move on when you can:
- Scrape multiple fields from each item on a page and store them as dicts.
- Map CSS classes or HTML attributes to meaningful values.
- Handle missing or unexpected values without crashing.
- Print structured data in a readable table format.