Level 2 / Project 03 - Data Cleaning Pipeline¶
Home: README
Try in Browser: Run this exercise online — no installation needed!
Before You Start¶
Recall these prerequisites before diving in:
- Can you use a set to track items you have already seen? (seen = set())
- Can you chain string methods? (text.strip().lower().replace(" ", ""))
Estimated time: 30 minutes
Learn Your Way¶
| Read | Build | Watch | Test | Review | Visualize | Try |
|---|---|---|---|---|---|---|
| Concept | This project | — | Quiz | Flashcards | Diagram | Browser |
Focus¶
- standardize text and required fields
Why this project exists¶
This project gives you level-appropriate practice in a realistic operations context. Goal: run the baseline, alter behavior, break one assumption, recover safely, and explain the fix.
Run (copy/paste)¶
Use <repo-root> as the folder containing this repository's README.md.
cd <repo-root>/projects/level-2/03-data-cleaning-pipeline
python project.py data/sample_input.txt
python project.py data/sample_input.txt --filter "@.*\."
pytest -q
Expected terminal output¶
Expected artifacts¶
- Cleaning stats and cleaned records on stdout
- Passing tests
- Updated
notes.md
Checkpoint: Baseline code runs and all tests pass. Commit your work before continuing.
Alter it (required) — Extension¶
- Add a cleaning step that normalizes phone numbers (strip dashes/parens).
- Write rejected records to a separate
quarantine.txtwith reasons. - Add a
--dry-runflag that reports stats without writing output.
Break it (required) — Core¶
- Feed a file where every line is whitespace — does it crash or return empty?
- Use a bad regex pattern in
--filter(e.g.[unclosed) — what happens? - Feed records with mixed encodings — does normalise_case break?
Fix it (required) — Core¶
- Wrap
re.compilein a try/except for invalid regex patterns. - Add a test for all-blank input files.
- Handle encoding errors gracefully with a
try/except UnicodeDecodeError.
Checkpoint: All modifications done, tests still pass. Good time to review your changes.
Explain it (teach-back)¶
- Why does the pipeline run steps in a specific order (strip before dedupe)?
- How does using a set for deduplication achieve O(1) lookup?
- What is the difference between
re.searchandre.match? - Where would you use a data cleaning pipeline in a real data workflow?
Mastery check¶
You can move on when you can:
- explain why pipeline step ordering matters,
- describe how sets enable fast deduplication,
- add a new cleaning step without modifying existing ones,
- explain regular expressions used in filter_by_pattern.
Related Concepts¶
- Collections Explained
- Functions Explained
- How Loops Work
- Virtual Environments
- Quiz: Collections Explained
Stuck? Ask AI¶
If you are stuck after trying for 20 minutes, use one of these prompts:
- "I am working on Data Cleaning Pipeline. I got this error: [paste error]. Can you explain what this error means without giving me the fix?"
- "I am trying to run cleaning steps in a specific order. Can you explain why the order matters when you strip whitespace before deduplicating?"
- "Can you explain
re.searchvsre.matchwith a simple example?"
| ← Prev | Home | Next → |
|---|---|---|