Module 01 / Project 01 — Fetch a Webpage¶
Learn Your Way¶
| Read | Build | Watch | Test | Review | Visualize | Try |
|---|---|---|---|---|---|---|
| — | This project | Walkthrough | — | Flashcards | — | — |
Focus¶
requests.get()to fetch a URL- HTTP status codes (200, 404, etc.)
- Inspecting
response.text,response.status_code, andresponse.headers
Why this project exists¶
Before you can scrape data from any website, you need to know how to fetch a page and understand what comes back. This project teaches you the fundamentals of making HTTP requests in Python. You will see the raw HTML that your browser normally renders for you, and you will learn to check whether a request succeeded or failed.
Run¶
Expected output¶
Fetching http://books.toscrape.com/ ...
Status code: 200
Content type: text/html
Content length: 51696 characters
First 500 characters of the page:
--------------------------------------------------
<!DOCTYPE html>
<!--[if lt IE 7]> <html lang="en-us" ...
(HTML content continues)
--------------------------------------------------
Done.
The exact character count and HTML will vary, but you should see status code 200 and recognizable HTML.
Alter it¶
- Change the URL to
http://books.toscrape.com/catalogue/page-2.htmland run again. What changes? What stays the same? - Add a line that prints
response.headersto see all the HTTP headers the server sent back. Pick two headers and look up what they mean. - Add a check: if the status code is not 200, print a warning message instead of the page content.
Break it¶
- Change the URL to
http://books.toscrape.com/this-page-does-not-exist. What status code do you get? - Change the URL to
http://definitely-not-a-real-website-abc123.com. What error do you get? (Hint: it is not a status code — it is a Python exception.) - Remove the
import requestsline and run the script. Read the error message carefully.
Fix it¶
- Wrap the
requests.get()call in a try/except block that catchesrequests.exceptions.RequestException. Print a friendly error message instead of a traceback. - After fetching, check
response.status_code. If it is 404, print "Page not found" and exit early. If it is anything other than 200, print the status code as a warning. - Put the import back if you removed it.
Explain it¶
- What is an HTTP status code and what does 200 mean?
- What is the difference between
response.textandresponse.content? - Why might
requests.get()raise an exception instead of returning a response? - What does the
Content-Typeheader tell you?
Mastery check¶
You can move on when you can:
- Fetch any URL and check whether it succeeded, from memory.
- Explain what a status code is without looking it up.
- Handle both HTTP errors (404) and connection errors (no internet) gracefully.
- Describe what
response.textcontains.