Level 9 / Project 11 - Recovery Time Estimator¶
Home: README
Learn Your Way¶
| Read | Build | Watch | Test | Review | Visualize | Try |
|---|---|---|---|---|---|---|
| — | This project | — | — | Flashcards | — | — |
Focus¶
- Monte Carlo simulation for probabilistic time estimation
- Recovery phases: detection, diagnosis, fix, verification, deployment
- Triangular distribution modeling with 3-point estimates (min, expected, max)
- Confidence intervals at p50, p90, and p99
- Impact of team size, runbooks, and incident novelty on recovery time
Why this project exists¶
When systems fail, stakeholders need realistic recovery time estimates — not guesses. "How long until the site is back?" requires modeling the full recovery process: detecting the issue, diagnosing root cause, implementing a fix, verifying it works, and deploying. Each phase has uncertainty. This project uses Monte Carlo simulation to model recovery times as probability distributions, producing confidence intervals that account for team capacity, runbook availability, and incident complexity — the same approach used by incident management teams at Google, Netflix, and PagerDuty.
Run (copy/paste)¶
Expected terminal output¶
{
"incident": "database-outage",
"estimates": {
"p50_minutes": 95,
"p90_minutes": 142,
"p99_minutes": 198
},
"phases": [...],
"scenario_comparison": {...}
}
7 passed
Expected artifacts¶
- Console JSON output with recovery time estimates and scenario comparison
- Passing tests
- Updated
notes.md
Alter it (required)¶
- Add a
communication_overheadphase that models coordination time for multi-team incidents. - Change the estimator to support custom estimation strategies via dependency injection.
- Add a
--runsflag that controls the number of Monte Carlo simulation iterations.
Break it (required)¶
- Set
team_size=0— does the diagnosis multiplier handle division by zero? - Create a
PhaseEstimatewheremin > max— what happens to the triangular distribution? - Run the simulation with
simulation_runs=0— does the percentile calculation handle empty data?
Fix it (required)¶
- Validate that
team_size >= 1inIncidentProfile. - Validate that
min <= expected <= maxinPhaseEstimate.__post_init__. - Return a default estimate when
simulation_runsis 0.
Explain it (teach-back)¶
- What is Monte Carlo simulation and why is it used for time estimation?
- How does the triangular distribution model 3-point estimates (min, expected, max)?
- What do p50, p90, and p99 mean in the context of recovery time estimation?
- Why do runbooks significantly reduce recovery time — what does the multiplier model?
Mastery check¶
You can move on when you can: - explain Monte Carlo simulation and triangular distributions in plain language, - run a scenario comparison and interpret the improvement percentage, - describe how team size, runbooks, and incident novelty affect recovery time, - add a new recovery phase and see its effect on the overall estimate.
Related Concepts¶
| ← Prev | Home | Next → |
|---|---|---|