Site Reliability Engineering

This section contains site reliability expert related notes.

📄️ What is Site Reliability Engineer?

Definition

📄️ Chaos Engineering

Chaos engineering is the practice of intentionally introducing controlled disruptions or failures into a system to test its resilience and reliability. The goal is to identify vulnerabilities, understand system behavior under stress, and build confidence in its ability to withstand unexpected conditions.

📄️ Distributed Tracing

Distributed tracing is a technique used to track requests as they flow through various services in a microservices architecture or a distributed system. It helps provide visibility into how requests are processed, how services interact, and where bottlenecks or failures may occur.

📄️ Kubernetes (k8s)

Kubernetes (often abbreviated as K8s) is an open-source platform for automating the deployment, scaling, and management of containerized applications. It provides a robust framework for running distributed systems reliably and efficiently.

📄️ SLA, SLO, and SLI Metrics

Understanding SLA, SLO, and SLI