Level 0 / Project 10 - Duplicate Line Finder¶

Learn Your Way¶

Read	Build	Watch	Test	Review	Visualize	Try
Concept	This project	—	Quiz	Flashcards	Diagram	Browser

Estimated time: 25 minutes

Focus¶

set usage and duplicate detection

Why this project exists¶

Find repeated lines in a text file, report how many times each appears and on which line numbers. You will learn dictionary-based counting and set-based uniqueness checks -- the fundamental pattern for deduplication.

Run (copy/paste)¶

Use <repo-root> as the folder containing this repository's README.md.

cd <repo-root>/projects/level-0/10-duplicate-line-finder
python project.py --input data/sample_input.txt
pytest -q

Expected terminal output¶

=== Duplicate Line Report ===
  Total lines: 5
  Unique lines: 3
  Duplicated lines: 2

  Duplicates found:
    'check server status' appears 2 times (lines 1, 3)
    'restart web service' appears 2 times (lines 2, 5)

  Report written to data/output.json
4 passed

Expected artifacts¶

data/output.json
Passing tests
Updated notes.md

Checkpoint: Baseline code runs and all tests pass. Commit your work before continuing.

Alter it (required) — Extension¶

Add case-insensitive duplicate detection (so "Hello" and "hello" count as duplicates).
Add a --ignore-blank flag that skips empty lines when checking for duplicates.
Re-run script and tests.

Break it (required) — Core¶

Use a file with no duplicates at all -- does find_duplicates() return an empty dict?
Use a file where every line is identical -- does the report show the correct count?
Use a file with trailing spaces -- are "hello" and "hello " treated as duplicates?

Fix it (required) — Core¶

Ensure load_lines() strips trailing whitespace so "hello " matches "hello".
Handle the no-duplicates case by printing a clear "No duplicates found" message.
Add a test for the all-unique-lines case.

Checkpoint: All modifications done, tests still pass. Good time to review your changes.

Explain it (teach-back)¶

Why does count_line_occurrences() use a dict to count instead of nested loops?
What is the difference between dict.get(key, 0) and dict[key] when the key might not exist?
Why does find_duplicates() only return lines with count > 1?
Where would duplicate detection appear in real software (data deduplication, log analysis, CSV cleaning)?

Mastery check¶

You can move on when you can: - run baseline without docs, - explain one core function line-by-line, - break and recover in one session, - keep tests passing after your change.

← Prev	Home	Next →