Level 4 / Project 03 - Robust CSV Ingestor¶
Home: README
Learn Your Way¶
| Read | Build | Watch | Test | Review | Visualize | Try |
|---|---|---|---|---|---|---|
| — | This project | — | — | Flashcards | — | Browser |
Estimated time: 50 minutes
Focus¶
- malformed row handling and recovery
Why this project exists¶
This project gives you level-appropriate practice in a realistic operations context. Goal: run the baseline, alter behavior, break one assumption, recover safely, and explain the fix.
Run (copy/paste)¶
Use <repo-root> as the folder containing this repository's README.md.
cd <repo-root>/projects/level-4/03-robust-csv-ingestor
python project.py --input data/sample_input.csv --output-dir data/output
pytest -q
Expected terminal output¶
Expected artifacts¶
data/output/clean_data.csv— rows that passed validationdata/output/quarantined_rows.csv— bad rows with row numbersdata/output/ingestion_report.json— summary with error details- Passing tests
- Updated
notes.md
Design First¶
Before writing code, sketch your approach in notes.md:
- What functions or classes do you need?
- What data structures will you use?
- What's the flow from input to output?
- What could go wrong?
Checkpoint: Baseline code runs and all tests pass. Commit your work before continuing.
Alter it (required) — Extension¶
- What early-stopping condition would be useful for very bad input files?
- Can you add a validation rule for a specific column's data type?
- Write a parametrized test for your new validation.
Break it (required) — Core¶
- What happens when the CSV structure itself is unusual (no headers, weird quoting)?
- Try embedding tricky characters inside fields — does the parser handle them?
- Find an edge case that confuses the row counting logic.
Fix it (required) — Core¶
- Make the tool configurable for the structural issue you found.
- Ensure error reporting includes enough context to fix the source data.
- Re-run until all tests pass.
Checkpoint: All modifications done, tests still pass. Good time to review your changes.
Explain it (teach-back)¶
- Why does the quarantine file prepend the original row number?
- What is the difference between "too few columns" and "all fields empty"?
- Why do we write the quarantine file even if it is empty?
- How would you adapt this pattern for streaming very large files (gigabytes)?
Mastery check¶
You can move on when you can: - run baseline without docs, - explain one core function line-by-line, - break and recover in one session, - keep tests passing after your change.
Related Concepts¶
Stuck? Ask AI¶
If you are stuck after trying for 20 minutes, use one of these prompts:
- "I am working on Robust CSV Ingestor. I got this error: [paste error]. Can you explain what this error means without giving me the fix?"
- "I am trying to handle malformed CSV rows without crashing. Can you explain strategies for recovering from bad rows in a data pipeline?"
- "Can you explain the difference between skipping bad rows, quarantining them, and fixing them automatically?"
| ← Prev | Home | Next → |
|---|---|---|