Level 2 / Project 06 - Records Deduplicator¶
Home: README
Learn Your Way¶
| Read | Build | Watch | Test | Review | Visualize | Try |
|---|---|---|---|---|---|---|
| Concept | This project | Walkthrough | Quiz | Flashcards | Diagram | Browser |
Estimated time: 35 minutes
Focus¶
- dedupe logic with stable keys
Why this project exists¶
This project gives you level-appropriate practice in a realistic operations context. Goal: run the baseline, alter behavior, break one assumption, recover safely, and explain the fix.
Run (copy/paste)¶
Use <repo-root> as the folder containing this repository's README.md.
cd <repo-root>/projects/level-2/06-records-deduplicator
python project.py data/sample_input.txt --keys name email
python project.py data/sample_input.txt --keys email --keep last
python project.py data/sample_input.txt --keys email --show-groups
pytest -q
Expected terminal output¶
Expected artifacts¶
- Dedup results on stdout
- Passing tests
- Updated
notes.md
Checkpoint: Baseline code runs and all tests pass. Commit your work before continuing.
Alter it (required) — Extension¶
- Add a
--outputflag that writes unique records to a new CSV file. - Add a
--case-sensitiveflag that disables lowercase normalisation. - Add a count of how many times each duplicate key appeared.
Break it (required) — Core¶
- Use a key field that does not exist in the CSV — what happens?
- Feed a CSV where every row is identical — is the output correct?
- Use an empty file — does it crash?
Fix it (required) — Core¶
- Validate that key_fields exist in the headers before processing.
- Handle the all-duplicates case gracefully.
- Add a test for missing key fields raising a clear error.
Checkpoint: All modifications done, tests still pass. Good time to review your changes.
Explain it (teach-back)¶
- Why are sets used for tracking seen keys instead of lists?
- What is the time complexity of
set.add()vslist.append()for lookups? - How does the
keep="last"mode differ in implementation fromkeep="first"? - When would you use deduplication in a real data pipeline?
Mastery check¶
You can move on when you can: - explain O(1) vs O(n) lookup time for sets vs lists, - implement dedup with configurable key fields from memory, - describe the difference between exact and fuzzy deduplication, - add a new keep mode (e.g. "all") without breaking existing tests.
Related Concepts¶
- Collections Explained
- Files and Paths
- Functions Explained
- How Loops Work
- Quiz: Collections Explained
Stuck? Ask AI¶
If you are stuck after trying for 20 minutes, use one of these prompts:
- "I am working on Records Deduplicator. I got this error: [paste error]. Can you explain what this error means without giving me the fix?"
- "I am trying to use a set to track items I have already seen. Can you explain why sets are faster than lists for membership checks?"
- "Can you explain the difference between
keep='first'andkeep='last'deduplication strategies with a simple example?"
| ← Prev | Home | Next → |
|---|---|---|