Level 10 / Project 06 - Resilience Chaos Workbench¶
Home: README
Learn Your Way¶
| Read | Build | Watch | Test | Review | Visualize | Try |
|---|---|---|---|---|---|---|
| Concept | This project | — | Quiz | Flashcards | — | — |
Focus¶
- Strategy pattern for injectable fault types
- Service state modeling with health simulation
- Resilience scoring and grading system
- Experiment rollback and recovery measurement
Why this project exists¶
Netflix's Chaos Monkey proved that systems must be tested against failure. This framework lets you define chaos experiments, inject faults into a service model, measure impact, and quantify resilience with letter grades — turning "we think it's reliable" into a measurable score.
Run (copy/paste)¶
Expected terminal output¶
Alter it (required)¶
- Add a
NetworkPartitionchaos action that removes all dependencies at once — implementapplyandrollback. - Add a
CombinedFaultthat applies multiple actions simultaneously (e.g., latency + errors). - Add a "blast radius" metric to the scorecard based on impact severity.
Break it (required)¶
- Set
MemoryPressureto 100% and observe the service becoming unhealthy. - Inject 100% error rate and verify the service fails all requests.
- Kill all dependencies and check how the health check responds.
Fix it (required)¶
- Add circuit-breaker logic to
ServiceState.handle_requestthat stops accepting requests when error rate exceeds a threshold. - Add a
graceful_degradationmode where the service returns cached responses when dependencies are down. - Test that degraded mode still counts as "recovered" in the scorecard.
Explain it (teach-back)¶
- How does the Strategy pattern make it easy to add new fault types without modifying the experiment runner?
- Why is rollback essential in chaos engineering — what happens if you don't roll back?
- How does the grading system translate recovery rates into actionable categories?
- How would you adapt this to test real distributed systems instead of a simulation?
Mastery check¶
You can move on when you can: - create a custom ChaosAction and run it in an experiment, - explain the relationship between fault injection, impact measurement, and rollback, - interpret a resilience scorecard and identify the weakest fault type, - describe how Netflix's Chaos Monkey works at a high level.
Related Concepts¶
| ← Prev | Home | Next → |
|---|---|---|