Maintaining a payroll system

Maintaining a payroll system#

https://upload.wikimedia.org/wikipedia/commons/1/14/Datacenter_de_ARSAT.jpg

Fig. 1 A data center
CC BY-SA 4.0. By IMarcoHerrera. Source: Wikimedia Commons
#

You are working as an engineer in a team that maintains a payroll system for a car components company called Parts Unlimited. One day, your manager forward the message they received.

> From: Dick Landry
> To: Steve Masters
> Date: September 2, 8:27 AM
> Priority: Highest
> Subject: ACTION NEEDED: payroll run is failing

Hey, Steve. We’ve got serious issues with this week’s payroll. We’re trying to figure out if the problem is with the numbers or in the payroll system. Either way, thousands of employees have paychecks stuck in system & are at risk of not getting paid. Seriously bad news. We must fix this before payroll window closes at 5 PM today. Please advise on how to escalate this, given the new IT org. Dick

Your manager adds:

… Employees not getting paychecks means families not being able to pay their mortgages or put food on the table.

… I realize[d] that my family’s mortgage payment is due in four days, and we could be one of the families affected. A late mortgage payment could screw up our credit rating even more, which we spent years repairing after we put Paige’s student loans on my credit card.

Assume that the problem is in the backend and the payroll processing infrastructure runs on two large data centers.

https://upload.wikimedia.org/wikipedia/commons/0/02/The_Thundering_Herd_-_1925_Lobby_Card.jpg

Fig. 2 Below of the picture depicts a thundering herd.
Public domain. By Famous Players-Lasky Corporation, Beinecke Library. Source: Wikimedia Commons
#

  • Is the payroll system a mission critical system?

  • How would you react to this problem as an engineer?

  • Which technical tools do you need to assess the problem?

  • Which tools could be helpful to foresee a problem like this?

  • Upon investigation, you see that a cron job was triggered on thousands of compute nodes, which could have lead to a thundering herd problem. How would you cope with such a problem during operation?

In the context of the thundering herd problem, thundering is used metaphorically – it refers to the loud and simultaneous rush of many animals, e.g., a herd of buffalo.