Alert on SLOs

How to Measure System Reliability describes the building blocks of the reliability stack that enables you to measure, assess, and have informed discussions about reliability:

  • Service Level Indicators (SLI) represent a quantifiable measure of service reliability
  • Service Level Objectives (SLO) define how many times an SLI has to be achieved for users to be happy with your service, within a time interval
  • Error Budgets represent the amount of reliability that is left from an SLO

SLO thresholds

Traditional alerting practices page engineers with alerts that do not directly track user satisfaction. Being paged at 3 a.m. because a server’s CPU usage has been above 80% for the past 5 minutes is standard procedure. And while this is important information, it’s not worth waking someone up if no users are being impacted.

  • It’s a reactive approach, meaning, you’ll receive an alert only when there’s more than the accepted level of unreliability
  • Static thresholds are unable to keep up with your service’s evolution. Thresholds are defined around the conditions of today and can quickly become obsolete. Service changes, hardware upgrades, or library patches can impact the meaningfulness of a threshold

Error Budget Burned

Error budget is the amount of reliability you can tolerate before reaching critical reliability levels and should not be treated like another metric to be tracked. An error budget can be leveraged to create alerts that can warn you when something is not working properly way before your SLO target is compromised. Let’s use the previous where you have an SLO that, for a rolling 30-window, 95% of requests need to be served under 200ms. This means that, of all served requests, 5% of them can take longer than 200ms. If all have been served under 200ms, you have all your error budget available (100%). You can then define alerts on the amount of error burned. For example:

  • 25% of error budget burned would trigger an email
  • 50% of error budget burned would trigger a message to slack
  • 75% of error budget burned would trigger a page

Burn rate

Burn rate indicates the speed at which your error budget is being consumed, relative to the SLO. Using the previous example, if you have an average error rate of 5% for the 30-day period, all your error budget will be consumed, corresponding to a burn rate of 1. If you were to have a burn rate of 2, all available error budget would be consumed in half the time window (15days).

Error Budget Policies

When faced with error budget violations, you need to know what to do. Error budget policies determine the thresholds and actions to be taken to ensure error budget consumption is tackled. For each SLO a policy should be in practice and revisited regularly.

  • If a single event consumes 25% or more of the error budget, a postmortem must be conducted, which must include a P0 action that addresses the reliability that caused it
  • If 75% of the error budget has been consumed, 50% of engineers must focus on reliability work
  • If the SLO has been breached, feature work must stop and all engineers must focus on restoring it. Only P0 issues and security fixes can be released until the service is back within SLO

Conclusion

SLI, SLO, and Error Budget provide the foundations needed to measure, assess and prioritize reliability in the eyes of users. They create a framework and a language that allow different teams to understand and talk about reliability.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Cprime

Cprime

An Alten Company, Cprime is a global consulting firm helping transforming businesses get in sync.