Backend maintenance

Backend maintenance#

Preparation#

Read the following:

How to utilize workload observability
- skip the implementation sections, which are mostly tailored to AWS solutions
From the book The Linux DevOps Handbook
- Monitoring, Tracing, Logging
  - Only (1) open-source solutions, (2) log and metrics retention
- Configuration as code
  - Only (1) CM systems and CaC, (2) Ansible

Aspects#

Aim for

reliability (high availability)
performance
security

Includes

updates
patches
security fixes
monitoring
logging
backups

Additionally operational aspects that we will not cover

incident response
disaster recovery planning

Workload observability during operation based on AWS well-protected framework#

Based on https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/operate.html

observability through metrics
- baselines should be defined
expected outcomes (e.g., customer satisfaction) => determine a number representing success (identify metrics)

The implementations are mostly based on AWS products, however the best practices are worth understanding.

Here is only the summary of how to utilize workload observability. The other two are related to operational health and responding to events, which we will not cover here.

Workload observability best practices:

Analyze workload metrics
- latency, requests, errors, capacity, etc
- anti-pattern: over-reliance on technical metrics
- but these metrics should be connected to business outcomes, e.g., less number of errors does not necessarily mean more customer satisfaction
Analyze workload logs
- anti-pattern: neglecting the analysis until a critical issue arises
- use tools that automatically analyze logs
Analyze workload traces
- compared to logs, a trace is a detailed record of application’s operational flow across different components.
- anti-pattern: overlooking trace data, relying solely on logs
Create actionable alerts
- anti-pattern: too many non-critical alerts, which leads to alert fatigue
Create dashboards
- visual representation of system and business health
- anti-pattern: relying on dashboards without alerts for anomaly detection

Exercise 19

Your backend system is experiencing high latency, and customers are complaining about slow responses. Which performance metrics could be useful to debug this problem?

Open source observability solutions for self-hosting#

Based on Chapter 10 - Monitoring, Tracing, and Distributed Logging from the The Linux DevOps Handbook

Prometheus
Grafana
SigNoz
New Relic Pixie
Graylog
Sentry

How much logs and metrics do we keep? Keyword: Log and metrics retention:

full retention
time-based retention
event-based retention
selective retention
tiered retention

Exercise 20

Your company generates 10GB of logs per day from microservices running in production. Keeping all logs indefinitely would be costly and unnecessary. You need to decide on a retention policy that balances cost, compliance, and troubleshooting needs.

Come up with a scenario for each of the five retention types, e.g., in which context would you choose full retention?

Configuration as code#

Based on Chapter 11 - Monitoring, Tracing, and Distributed Logging from the The Linux DevOps Handbook

Saltstack
Chef
Puppet
CFEngine
Ansible <= preferred by the author

Exercise 21

A recent security vulnerability has been discovered in your web server software. Your team decides to use Ansible to automate the deployment of security patches across multiple servers.

What could be an alternative to using Ansible for security-updates?

Web-based system administration using Webmin#

From the chapter Server management and maintenance from the book Practical Linux DevOps.

manage, maintain, backup using Webmin and chef
your university may have a license for the chapter

Further resources#

Part IV - Maintaining Systems from the book Building Secure and Reliable Systems by Google
- part IV focuses on operational aspects like disaster planning, crisis management, recovery & aftermath
chapter 20 - Load balancing in the data center from the book Site Reliability Engineering.
- algorithms for evenly distributing the work within a given data center.
chapter 24 - Distributed Periodic Scheduling with Cron from the book Site Reliability Engineering.
- the industry moves toward large distributed systems. If the whole data center is the smallest effective unit of hardware, then cron should be used in a distributed way.
chapter 25 - Data processing pipelines from the book Site Reliability Engineering.
- periodic pipelines triggered by, e.g., cron, are fragile
- a continuous pipeline is proposed: Google Workflow