Level 6 / Project 08 - Data Lineage Capture¶
Home: README
Learn Your Way¶
| Read | Build | Watch | Test | Review | Visualize | Try |
|---|---|---|---|---|---|---|
| — | This project | — | — | Flashcards | — | — |
Focus¶
- capture lineage metadata fields
Why this project exists¶
This project gives you level-appropriate practice in a realistic operations context. Goal: run the baseline, alter behavior, break one assumption, recover safely, and explain the fix.
Run (copy/paste)¶
Use <repo-root> as the folder containing this repository's README.md.
cd <repo-root>/projects/level-6/08-data-lineage-capture
python project.py --input data/sample_input.txt --output data/output_summary.json
pytest -q
Expected terminal output¶
{
"records_processed": 3,
"pipeline_steps": 3,
"total_lineage_entries": 9,
"lineage": {"order-101": [...], ...}
}
Expected artifacts¶
data/output_summary.json— lineage chains per record- Passing tests (
pytest -q→ 6+ passed) - Updated
notes.md
Alter it (required)¶
- Add a
filterstep betweennormalizeandpublishthat removes records below a threshold value. - Add a
format_lineage_report(key)function that prints a record's full journey in human-readable form. - Store a hash of the data at each step so you can verify no corruption occurred.
- Re-run script and tests after each change.
Break it (required)¶
- Process the same records twice and observe whether lineage entries are duplicated.
- Remove the
normalizestep and observe the gap in the lineage chain. - Pass a record with a missing
keyfield and trace what happens.
Fix it (required)¶
- Add deduplication: skip lineage recording if the same (record_key, step) already exists.
- Validate that every record has a
keyfield before processing. - Add a lineage integrity check that verifies parent_id chains are unbroken.
Explain it (teach-back)¶
- What is data lineage, and why do regulated industries require it?
- How does the parent_id column create a chain of custody for each record?
- What is a recursive CTE, and how could it trace a full lineage path?
- How would you implement lineage in a real data warehouse?
Mastery check¶
You can move on when you can: - run baseline without docs, - explain one core function line-by-line, - break and recover in one session, - keep tests passing after your change.
Related Concepts¶
| ← Prev | Home | Next → |
|---|---|---|