SLA, SLO, and SLI Metrics
Understanding SLA, SLO, and SLI
Service Level Agreement (SLA)
- Definition: A contract between a service provider and a customer specifying the expected level of service.
- Purpose: Defines obligations and consequences if the agreed*upon reliability is not met.
- Example: "The service will be available 99.9% of the time per month. If this is not met, a refund of 10% will be issued."
Service Level Objective (SLO)
- Definition: A specific, measurable target for the level of service reliability.
- Purpose: Serves as a benchmark to ensure the SLA is met.
- Example: "99.95% of HTTP requests will return a response within 200ms."
- Relation to SLA: Typically stricter than the SLA, giving room to address failures before breaching the SLA.
Service Level Indicator (SLI)
- Definition: A metric that quantifies system performance to track compliance with SLOs.
- Purpose: Acts as the raw data used to measure whether an SLO is met.
- Example: "The percentage of successful HTTP requests over the past 30 days."
Example Relationships Between SLA, SLO, and SLI
- SLI: Measured uptime = 99.93%
- SLO: Target uptime = 99.95%
- SLA: Guaranteed uptime = 99.90%
Calculating Availability and Reliability
Availability is often expressed as a percentage of uptime over a given period:
Example:
- Total Time: 30 days (43,200 minutes)
- Downtime: 30 minutes
Reliability measures the likelihood of a system performing without failure over a specific time:
Example:
- Failures: 2
- Time: 100 hours
Examples of Defining Metrics for a Service
Let’s define SLAs, SLOs, and SLIs for a simple web API
1. SLI Examples:
- Latency: 95% of requests complete within 200ms.
- Uptime: Percentage of time the service is reachable.
- Error Rate: Percentage of failed requests.
2. SLO Examples:
- Uptime SLO: "The service uptime will be at least 99.95% per month."
- Latency SLO: "95% of requests will complete within 200ms over the past week."
- Error Rate SLO: "The error rate will not exceed 0.1% of total requests over the past month."
3. SLA Example:
- SLA: "The service will maintain 99.9% uptime per month. For every 0.1% below this threshold, a 5% refund of the monthly fee will be issued."
Monitoring and Implementation
Monitoring SLIs:
- Latency: Use tools like Prometheus to track response time.
- Uptime: Use uptime monitoring tools like Pingdom or a custom Prometheus exporter.
- Error Rate: Count HTTP 4xx and 5xx responses using metrics.
Visualizing in Grafana:
- Create panels for each metric to display SLI performance over time.
- Set alerts when an SLO is violated (e.g., error rate exceeds 0.1%).