Backend maintenance#
Preparation#
Read the following:
How to utilize workload observability
skip the implementation sections, which are mostly tailored to AWS solutions
From the book The Linux DevOps Handbook
Monitoring, Tracing, Logging
Only (1) open-source solutions, (2) log and metrics retention
Configuration as code
Only (1) CM systems and CaC, (2) Ansible
Aspects#
Aim for
reliability (high availability)
performance
security
Includes
updates
patches
security fixes
monitoring
logging
backups
Additionally operational aspects that we will not cover
incident response
disaster recovery planning
Workload observability during operation based on AWS well-protected framework#
Based on https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/operate.html
observability through metrics
baselines should be defined
expected outcomes (e.g., customer satisfaction) => determine a number representing success (identify metrics)
The implementations are mostly based on AWS products, however the best practices are worth understanding.
Here is only the summary of how to utilize workload observability. The other two are related to operational health and responding to events, which we will not cover here.
Workload observability best practices:
Analyze workload metrics
latency, requests, errors, capacity, etc
anti-pattern: over-reliance on technical metrics
but these metrics should be connected to business outcomes, e.g., less number of errors does not necessarily mean more customer satisfaction
Analyze workload logs
anti-pattern: neglecting the analysis until a critical issue arises
use tools that automatically analyze logs
Analyze workload traces
compared to logs, a trace is a detailed record of application’s operational flow across different components.
anti-pattern: overlooking trace data, relying solely on logs
Create actionable alerts
anti-pattern: too many non-critical alerts, which leads to alert fatigue
Create dashboards
visual representation of system and business health
anti-pattern: relying on dashboards without alerts for anomaly detection
Exercise 19
Your backend system is experiencing high latency, and customers are complaining about slow responses. Which performance metrics could be useful to debug this problem?
Open source observability solutions for self-hosting#
Based on Chapter 10 - Monitoring, Tracing, and Distributed Logging from the The Linux DevOps Handbook
Prometheus
Grafana
SigNoz
New Relic Pixie
Graylog
Sentry
How much logs and metrics do we keep? Keyword: Log and metrics retention:
full retention
time-based retention
event-based retention
selective retention
tiered retention
Exercise 20
Your company generates 10GB of logs per day from microservices running in production. Keeping all logs indefinitely would be costly and unnecessary. You need to decide on a retention policy that balances cost, compliance, and troubleshooting needs.
Come up with a scenario for each of the five retention types, e.g., in which context would you choose full retention?
Configuration as code#
Based on Chapter 11 - Monitoring, Tracing, and Distributed Logging from the The Linux DevOps Handbook
Saltstack
Chef
Puppet
CFEngine
Ansible <= preferred by the author
Exercise 21
A recent security vulnerability has been discovered in your web server software. Your team decides to use Ansible to automate the deployment of security patches across multiple servers.
What could be an alternative to using Ansible for security-updates?
Web-based system administration using Webmin#
From the chapter Server management and maintenance from the book Practical Linux DevOps.
Further resources#
Part IV - Maintaining Systems from the book Building Secure and Reliable Systems by Google
part IV focuses on operational aspects like disaster planning, crisis management, recovery & aftermath
chapter 20 - Load balancing in the data center from the book Site Reliability Engineering.
algorithms for evenly distributing the work within a given data center.
chapter 24 - Distributed Periodic Scheduling with Cron from the book Site Reliability Engineering.
the industry moves toward large distributed systems. If the whole data center is the smallest effective unit of hardware, then
cron
should be used in a distributed way.
chapter 25 - Data processing pipelines from the book Site Reliability Engineering.
periodic pipelines triggered by, e.g.,
cron
, are fragilea continuous pipeline is proposed: Google Workflow