Downtime and data loss are an unavoidable fact of life for systems administrators. The main thing in our control about such service failures is their frequency and their severity.
Often, these failures are due to embarrassingly simple errors, omissions, oversights, or mistakes on the part of the sysadmin. This isn't because sysadmins are stupid; rather, there's usually so much to do, some tasks and checkups fall through the cracks.
This checklist is intended to provide a preventative checklist for people administering server machines. Admins can use it to do a system checkup once every 1-6 months to reduce the chances, and effects, of serious system failures. Even more valuable is to use it for a peer audit by another trusted, experienced sysadmin. Another person can provide a fresh perspective, and they won't get distracted fixing one potential problem and forget to do the other checks.
Note that the questions may seem particularly stupid; wise sysadmins don't assume that they would never make such simple and ridiculous mistake... usually because they've learned from experience that they can.
Time
- Is the time set correctly?
- Does the default timezone make sense?
- Is there a time synchronization tool (like ntp) installed on the server?
- Does an cron job synchronize the time on the server on a periodic basis?
- Do the ntp servers that the ntp job uses exist, and are they available?
Disk
- Is there sufficient disk space? On all devices and partitions?
- Is there a disk space monitor (either a daemon or a cron job)? Is it running?
- If there's a disk space monitor, do notifications of problems go to the correct email address?
Filesystem backup
- Are volatile top-level directories (/etc/, /var/, /home/) backed up regularly? Other important data?
- Does the backup frequency match the volatility of the data? If data changes a lot daily, do backups happen daily?
- Are old backups saved for a period of time, just in case? For example, the last 7 backups.
- Are the backups stored off the main filesystem (e.g., on tapes, on other removable media, on an external harddrive, on another server)?
- Are the backups stored off-site?
- Did the last backup job run?
- Do the last few backup archives look like they're about the right size? Do they look like they have all the data that they should have?
Database
