Backend maintenance#

Preparation#

Read the following:

  • How to utilize workload observability

    • skip the implementation sections, which are mostly tailored to AWS solutions

  • From the book The Linux DevOps Handbook

    • Monitoring, Tracing, Logging

      • Only (1) open-source solutions, (2) log and metrics retention

    • Configuration as code

      • Only (1) CM systems and CaC, (2) Ansible

Aspects#

Aim for

  • reliability (high availability)

  • performance

  • security

Includes

  • updates

  • patches

  • security fixes

  • monitoring

  • logging

  • backups

Additionally operational aspects that we will not cover

  • incident response

  • disaster recovery planning

Workload observability during operation based on AWS well-protected framework#

Based on https://docs.aws.amazon.com/wellarchitected/latest/operational-excellence-pillar/operate.html

  • observability through metrics

    • baselines should be defined

  • expected outcomes (e.g., customer satisfaction) => determine a number representing success (identify metrics)

The implementations are mostly based on AWS products, however the best practices are worth understanding.

Here is only the summary of how to utilize workload observability. The other two are related to operational health and responding to events, which we will not cover here.

Workload observability best practices:

  1. Analyze workload metrics

    • latency, requests, errors, capacity, etc

    • anti-pattern: over-reliance on technical metrics

    • but these metrics should be connected to business outcomes, e.g., less number of errors does not necessarily mean more customer satisfaction

  2. Analyze workload logs

    • anti-pattern: neglecting the analysis until a critical issue arises

    • use tools that automatically analyze logs

  3. Analyze workload traces

    • compared to logs, a trace is a detailed record of application’s operational flow across different components.

    • anti-pattern: overlooking trace data, relying solely on logs

  4. Create actionable alerts

    • anti-pattern: too many non-critical alerts, which leads to alert fatigue

  5. Create dashboards

    • visual representation of system and business health

    • anti-pattern: relying on dashboards without alerts for anomaly detection

Exercise 19

Your backend system is experiencing high latency, and customers are complaining about slow responses. Which performance metrics could be useful to debug this problem?

Open source observability solutions for self-hosting#

Based on Chapter 10 - Monitoring, Tracing, and Distributed Logging from the The Linux DevOps Handbook

  • Prometheus

  • Grafana

  • SigNoz

  • New Relic Pixie

  • Graylog

  • Sentry

How much logs and metrics do we keep? Keyword: Log and metrics retention:

  • full retention

  • time-based retention

  • event-based retention

  • selective retention

  • tiered retention

Exercise 20

Your company generates 10GB of logs per day from microservices running in production. Keeping all logs indefinitely would be costly and unnecessary. You need to decide on a retention policy that balances cost, compliance, and troubleshooting needs.

Come up with a scenario for each of the five retention types, e.g., in which context would you choose full retention?

Configuration as code#

Based on Chapter 11 - Monitoring, Tracing, and Distributed Logging from the The Linux DevOps Handbook

  • Saltstack

  • Chef

  • Puppet

  • CFEngine

  • Ansible <= preferred by the author

Exercise 21

A recent security vulnerability has been discovered in your web server software. Your team decides to use Ansible to automate the deployment of security patches across multiple servers.

What could be an alternative to using Ansible for security-updates?

Web-based system administration using Webmin#

From the chapter Server management and maintenance from the book Practical Linux DevOps.

  • manage, maintain, backup using Webmin and chef

  • your university may have a license for the chapter

Further resources#