Level 4 / Project 03 - Robust CSV Ingestor¶

Learn Your Way¶

Read	Build	Watch	Test	Review	Visualize	Try
—	This project	—	—	Flashcards	—	Browser

Estimated time: 50 minutes

Focus¶

malformed row handling and recovery

Why this project exists¶

This project gives you level-appropriate practice in a realistic operations context. Goal: run the baseline, alter behavior, break one assumption, recover safely, and explain the fix.

Run (copy/paste)¶

Use <repo-root> as the folder containing this repository's README.md.

cd <repo-root>/projects/level-4/03-robust-csv-ingestor
python project.py --input data/sample_input.csv --output-dir data/output
pytest -q

Expected terminal output¶

{
  "total_rows": 8,
  "good": 5,
  "quarantined": 3,
  "errors": [ ... ]
}
5 passed

Expected artifacts¶

data/output/clean_data.csv — rows that passed validation
data/output/quarantined_rows.csv — bad rows with row numbers
data/output/ingestion_report.json — summary with error details
Passing tests
Updated notes.md

Design First¶

Before writing code, sketch your approach in notes.md: - What functions or classes do you need? - What data structures will you use? - What's the flow from input to output? - What could go wrong?

Checkpoint: Baseline code runs and all tests pass. Commit your work before continuing.

Alter it (required) — Extension¶

What early-stopping condition would be useful for very bad input files?
Can you add a validation rule for a specific column's data type?
Write a parametrized test for your new validation.

Break it (required) — Core¶

What happens when the CSV structure itself is unusual (no headers, weird quoting)?
Try embedding tricky characters inside fields — does the parser handle them?
Find an edge case that confuses the row counting logic.

Fix it (required) — Core¶

Make the tool configurable for the structural issue you found.
Ensure error reporting includes enough context to fix the source data.
Re-run until all tests pass.

Checkpoint: All modifications done, tests still pass. Good time to review your changes.

Explain it (teach-back)¶

Why does the quarantine file prepend the original row number?
What is the difference between "too few columns" and "all fields empty"?
Why do we write the quarantine file even if it is empty?
How would you adapt this pattern for streaming very large files (gigabytes)?

Mastery check¶

You can move on when you can: - run baseline without docs, - explain one core function line-by-line, - break and recover in one session, - keep tests passing after your change.

Stuck? Ask AI¶

If you are stuck after trying for 20 minutes, use one of these prompts:

"I am working on Robust CSV Ingestor. I got this error: [paste error]. Can you explain what this error means without giving me the fix?"
"I am trying to handle malformed CSV rows without crashing. Can you explain strategies for recovering from bad rows in a data pipeline?"
"Can you explain the difference between skipping bad rows, quarantining them, and fixing them automatically?"

← Prev	Home	Next →