Level 4 / Project 07 - Duplicate Record Investigator¶
Home: README
Learn Your Way¶
| Read | Build | Watch | Test | Review | Visualize | Try |
|---|---|---|---|---|---|---|
| — | This project | — | — | Flashcards | — | Browser |
Estimated time: 60 minutes
Focus¶
- collision analysis and root cause
Why this project exists¶
This project gives you level-appropriate practice in a realistic operations context. Goal: run the baseline, alter behavior, break one assumption, recover safely, and explain the fix.
Run (copy/paste)¶
Use <repo-root> as the folder containing this repository's README.md.
cd <repo-root>/projects/level-4/07-duplicate-record-investigator
python project.py --input data/sample_input.csv --output data/duplicates_report.json --keys name,email --threshold 0.8
pytest -q
Expected terminal output¶
Expected artifacts¶
data/duplicates_report.json— exact and fuzzy duplicate pairs- Passing tests
- Updated
notes.md
Checkpoint: Baseline code runs and all tests pass. Commit your work before continuing.
Alter it (required) — Extension¶
- Add a
--methodflag supporting bothbigram(current) andlevenshteinsimilarity. - Add a
--groupmode that clusters duplicates into groups instead of listing pairs. - Re-run script and tests — add a parametrized test for the new method.
Break it (required) — Core¶
- Use a very low threshold (0.1) and observe how many false positives appear.
- Feed it a CSV with only one row — verify no crash on the single-record case.
- Use key fields that do not exist in the CSV and observe the behavior.
Fix it (required) — Core¶
- Validate that key fields exist in the CSV headers before comparing.
- Add a warning when the threshold produces more than 50% of records as duplicates.
- Re-run until all tests pass.
Checkpoint: All modifications done, tests still pass. Good time to review your changes.
Explain it (teach-back)¶
- What are character bigrams and why are they useful for fuzzy matching?
- Why does Jaccard similarity use set intersection/union instead of comparing characters directly?
- What is the time complexity of the nested-loop comparison — how would you optimize it?
- When would fuzzy matching produce false positives, and how would you handle them?
Mastery check¶
You can move on when you can: - run baseline without docs, - explain one core function line-by-line, - break and recover in one session, - keep tests passing after your change.
Related Concepts¶
| ← Prev | Home | Next → |
|---|---|---|