SLA, SLO, and SLI Metrics

Understanding SLA, SLO, and SLI

Definition: A contract between a service provider and a customer specifying the expected level of service.
Purpose: Defines obligations and consequences if the agreed*upon reliability is not met.
Example: "The service will be available 99.9% of the time per month. If this is not met, a refund of 10% will be issued."

Definition: A specific, measurable target for the level of service reliability.
Purpose: Serves as a benchmark to ensure the SLA is met.
Example: "99.95% of HTTP requests will return a response within 200ms."
Relation to SLA: Typically stricter than the SLA, giving room to address failures before breaching the SLA.

Definition: A metric that quantifies system performance to track compliance with SLOs.
Purpose: Acts as the raw data used to measure whether an SLO is met.
Example: "The percentage of successful HTTP requests over the past 30 days."

Availability is often expressed as a percentage of uptime over a given period:

\text{Availability (\%)} = \left( \frac{\text{Uptime}}{\text{Total Time}} \right) \times 100

Example:

\text{Availability} = \left( \frac{43,200 * 30}{43,200} \right) \times 100 = 99.93\%

Reliability measures the likelihood of a system performing without failure over a specific time:

\text{Reliability (\%)} = e^{*\left(\frac{\text{Total Failures}}{\text{Total Time}}\right)}

Example:

\text{Reliability} = e^{*\left(\frac{2}{100}\right)} = e^{*0.02} \approx 98.02\%

1. SLI Examples:

2. SLO Examples:

Uptime SLO: "The service uptime will be at least 99.95% per month."
Latency SLO: "95% of requests will complete within 200ms over the past week."
Error Rate SLO: "The error rate will not exceed 0.1% of total requests over the past month."

3. SLA Example:

SLA: "The service will maintain 99.9% uptime per month. For every 0.1% below this threshold, a 5% refund of the monthly fee will be issued."

Monitoring SLIs:

Latency: Use tools like Prometheus to track response time.
Uptime: Use uptime monitoring tools like Pingdom or a custom Prometheus exporter.
Error Rate: Count HTTP 4xx and 5xx responses using metrics.

Visualizing in Grafana: