Web Scraping Pipeline — Diagrams¶
Overview¶
These diagrams trace the full web scraping workflow from fetching a page to storing structured data, including politeness strategies, error handling, and the parse-extract-transform pipeline.
End-to-End Scraping Pipeline¶
A well-structured scraper separates fetching, parsing, extraction, transformation, and storage into distinct stages. Each stage has a clear responsibility and failure mode.
flowchart TD
URL["Target URL<br/>https://example.com/products"] --> FETCH["Fetch<br/>requests.get(url, headers, timeout)"]
FETCH --> CHECK{"Status OK?"}
CHECK -->|"200"| PARSE["Parse HTML<br/>BeautifulSoup(html, 'html.parser')"]
CHECK -->|"403/429"| RETRY["Rate Limited or Blocked<br/>Wait, rotate headers, retry"]
CHECK -->|"404/500"| LOG["Log Error<br/>Skip URL, continue"]
RETRY --> FETCH
PARSE --> EXTRACT["Extract Data<br/>soup.select('div.product')<br/>CSS selectors or find_all()"]
EXTRACT --> TRANSFORM["Transform & Clean<br/>Strip whitespace<br/>Convert price to float<br/>Normalize dates"]
TRANSFORM --> VALIDATE{"Data valid?"}
VALIDATE -->|"Yes"| STORE["Store<br/>CSV, JSON, SQLite, or API"]
VALIDATE -->|"No"| DISCARD["Log & Discard<br/>Missing required fields"]
STORE --> NEXT{"More pages?"}
NEXT -->|"Yes"| DELAY["Polite Delay<br/>time.sleep(1-3)"]
DELAY --> URL
NEXT -->|"No"| DONE["Done<br/>Report: N items scraped"]
style FETCH fill:#4a9eff,stroke:#2670c2,color:#fff
style PARSE fill:#cc5de8,stroke:#9c36b5,color:#fff
style EXTRACT fill:#ff922b,stroke:#e8590c,color:#fff
style TRANSFORM fill:#ffd43b,stroke:#f59f00,color:#000
style STORE fill:#51cf66,stroke:#27ae60,color:#fff
style RETRY fill:#ff6b6b,stroke:#c92a2a,color:#fff
Key points:
- Always set a timeout on requests to avoid hanging forever on unresponsive servers
- Separate extraction (finding elements) from transformation (cleaning data) for clearer code
- Validate extracted data before storing: catch missing fields early
- Add delays between requests to avoid overwhelming the target server
Politeness and Rate Limiting Strategy¶
Responsible scraping means respecting the server. These strategies prevent your scraper from being blocked and keep you on the right side of server policies.
flowchart TD
START["Before Scraping"] --> ROBOTS["Check robots.txt<br/>Respect Disallow rules"]
ROBOTS --> HEADERS["Set Realistic Headers<br/>User-Agent, Accept, Accept-Language"]
HEADERS --> STRATEGY{"Choose Strategy"}
STRATEGY --> FIXED["Fixed Delay<br/>time.sleep(2)<br/>Simple, predictable"]
STRATEGY --> RANDOM["Random Delay<br/>time.sleep(random.uniform(1, 3))<br/>Looks more natural"]
STRATEGY --> ADAPTIVE["Adaptive Delay<br/>Slow down on 429s<br/>Speed up on 200s"]
subgraph ANTI_BLOCK ["Anti-Blocking Techniques"]
ROTATE_UA["Rotate User-Agents<br/>Random browser string per request"]
SESSION_USE["Use requests.Session()<br/>Maintain cookies naturally"]
CACHE["Cache Responses<br/>Don't re-fetch unchanged pages"]
end
FIXED --> ANTI_BLOCK
RANDOM --> ANTI_BLOCK
ADAPTIVE --> ANTI_BLOCK
subgraph LIMITS ["Know Your Limits"]
RESPECT["Respect rate limit headers<br/>X-RateLimit-Remaining<br/>Retry-After"]
TOS["Check Terms of Service<br/>Some sites prohibit scraping"]
API_FIRST["Prefer APIs when available<br/>Structured data, no parsing needed"]
end
style START fill:#51cf66,stroke:#27ae60,color:#fff
style ANTI_BLOCK fill:#ffd43b,stroke:#f59f00,color:#000
style LIMITS fill:#ff922b,stroke:#e8590c,color:#fff
Key points:
- Always check robots.txt before scraping (it tells you which paths are off-limits)
- Random delays between 1-3 seconds look more natural than fixed intervals
- Rotate User-Agent strings to reduce the chance of being fingerprinted
- If the site has a public API, use it instead of scraping HTML
Parse and Extract: BeautifulSoup Workflow¶
The parsing stage turns raw HTML into a navigable tree. Understanding selector strategies helps you write resilient scrapers that survive minor page layout changes.
flowchart TD
HTML["Raw HTML String<br/>response.text"] --> SOUP["BeautifulSoup(html, 'html.parser')<br/>Builds parse tree"]
SOUP --> SELECT_STRATEGY{"Selector Strategy"}
SELECT_STRATEGY --> CSS["CSS Selectors<br/>soup.select('div.product > h2 a')<br/>Familiar, powerful"]
SELECT_STRATEGY --> FIND["find / find_all<br/>soup.find_all('a', class_='product-link')<br/>Pythonic, flexible"]
SELECT_STRATEGY --> XPATH_ALT["Tag Navigation<br/>tag.parent / tag.next_sibling<br/>For tricky layouts"]
CSS --> DATA["Extract Data"]
FIND --> DATA
XPATH_ALT --> DATA
DATA --> TEXT["element.get_text(strip=True)<br/>Clean text content"]
DATA --> ATTR["element['href']<br/>element.get('src', '')<br/>Attribute values"]
DATA --> NESTED["element.select_one('.price')<br/>Nested element search"]
TEXT --> DICT["Build dict per item<br/>{'name': ..., 'price': ..., 'url': ...}"]
ATTR --> DICT
NESTED --> DICT
DICT --> LIST["Collect all items<br/>items.append(dict)"]
style SOUP fill:#cc5de8,stroke:#9c36b5,color:#fff
style CSS fill:#4a9eff,stroke:#2670c2,color:#fff
style FIND fill:#4a9eff,stroke:#2670c2,color:#fff
style DATA fill:#ffd43b,stroke:#f59f00,color:#000
style DICT fill:#51cf66,stroke:#27ae60,color:#fff
Key points:
- CSS selectors (select) are usually the cleanest approach for well-structured HTML
- get_text(strip=True) removes whitespace; get('attr', default) safely accesses attributes
- Build a dictionary per item, then collect into a list for easy CSV/JSON export
- Test selectors interactively in a Python shell before writing the full scraper
| Back to Diagram Index |
|---|