Level 9 / Project 04 - Observability SLO Pack¶
Home: README
Learn Your Way¶
| Read | Build | Watch | Test | Review | Visualize | Try |
|---|---|---|---|---|---|---|
| — | This project | — | — | Flashcards | — | — |
Focus¶
- SRE primitives: SLO, SLI, and error budget management
- Time-window compliance calculation with rolling windows
- Burn-rate alerting for early degradation detection
- Strategy pattern for different SLI types (availability, latency)
- Structured SLO dashboard reporting
Why this project exists¶
SLOs (Service Level Objectives) are the foundation of Site Reliability Engineering. A team with a 99.9% availability SLO has an error budget of 0.1% — roughly 43 minutes of downtime per month. When the budget is exhausted, feature work stops and reliability becomes the priority. This project builds an SLO management system that tracks SLIs, computes compliance, calculates burn rates, and manages error budgets — the same system Google SRE teams use to balance reliability with feature velocity.
Run (copy/paste)¶
Expected terminal output¶
Expected artifacts¶
- Console JSON output with SLO compliance and burn rate data
- Passing tests
- Updated
notes.md
Alter it (required)¶
- Add a
latencySLI type that tracks p99 latency instead of success ratios. - Add multi-window burn-rate alerting (e.g. 1-hour and 6-hour windows with different thresholds).
- Add a
--dashboardflag that outputs a formatted text dashboard of all SLO statuses.
Break it (required)¶
- Set
target_pct=100.0— what happens to the error budget (it becomes 0%)? - Record zero events and check compliance — does the SLI value calculation divide by zero?
- Set a burn rate threshold of 0 — does every SLO trigger an alert?
Fix it (required)¶
- Validate that
target_pct < 100.0(a 100% target has zero error budget). - Add a guard in
SLI.valuethat returns 100.0 whentotal_count == 0. - Validate burn rate thresholds are positive in
check_burn_rates.
Explain it (teach-back)¶
- What is an SLI, SLO, and error budget — how do they relate?
- How does burn rate indicate whether you will exhaust your error budget early?
- Why is a 100% availability target impractical — what is the "nines" system?
- How do Google SRE teams use error budgets to balance reliability vs feature velocity?
Mastery check¶
You can move on when you can: - calculate error budgets from SLO targets (e.g. 99.9% = 0.1% budget), - explain burn rate alerting and why it catches slow degradations, - add a new SLI type and wire it through the full pack, - describe how real teams use error budgets to decide when to freeze deployments.
Mastery Check¶
- Can you explain the architectural trade-offs in your solution?
- Could you refactor this for a completely different use case?
- Can you identify at least two alternative approaches and explain why you chose yours?
- Could you debug this without print statements, using only breakpoint()?
Related Concepts¶
| ← Prev | Home | Next → |
|---|---|---|