Level 5 / Project 08 - Cross File Joiner¶
Home: README
Learn Your Way¶
| Read | Build | Watch | Test | Review | Visualize | Try |
|---|---|---|---|---|---|---|
| — | This project | — | — | Flashcards | — | Browser |
Quick Recall: This project uses dictionary lookups to match records across files. Before starting, make sure you can: build a dictionary from CSV rows using one column as the key, then look up values by that key (Level 1, Project 05 - CSV First Reader).
Estimated time: 75 minutes
Focus¶
- join records across source files
Why this project exists¶
This project gives you level-appropriate practice in a realistic operations context. Goal: run the baseline, alter behavior, break one assumption, recover safely, and explain the fix.
Run (copy/paste)¶
Use <repo-root> as the folder containing this repository's README.md.
cd <repo-root>/projects/level-5/08-cross-file-joiner
python project.py --left data/employees.csv --right data/departments.csv --key dept_id --join inner --output data/joined.json
pytest -q
Expected terminal output¶
Expected artifacts¶
data/joined.json- Passing tests
- Updated
notes.md
Checkpoint: Baseline code runs and all tests pass. Commit your work before continuing.
Alter it (required) — Extension¶
- Add a
--join fullmode that includes unmatched rows from both sides with null fills. - Add column selection:
--select name,dept_nameto keep only specific fields in output. - Print a summary of matched, left-only, and right-only counts.
- Re-run script and tests.
Break it (required) — Core¶
- Use a
--keythat exists in only one of the two files. - Use files with duplicate keys and observe which row wins.
- Capture the first failing test or visible bad output.
Fix it (required) — Core¶
- Validate that the join key exists in both files before joining.
- Document or handle the duplicate-key behavior explicitly (first-wins or last-wins).
- Add tests for missing keys and duplicates.
- Re-run until output and tests are deterministic.
Checkpoint: All modifications done, tests still pass. Good time to review your changes.
Explain it (teach-back)¶
- What is the difference between inner, left, and full outer joins?
- How does
index_by_keybuild a lookup dictionary for fast matching? - Why does the full join need to track "already matched" right-side keys?
- Where do you see cross-file joining in data pipelines (SQL JOINs, pandas merge)?
Mastery check¶
You can move on when you can: - run baseline without docs, - explain one core function line-by-line, - break and recover in one session, - keep tests passing after your change.
Related Concepts¶
| ← Prev | Home | Next → |
|---|---|---|