Following discussion with slurm-llnl's maintainer, here's a testing setup:
Create 3 VMs:
one with slurmd (work node, 2 CPUs)
one with slurmctld
and one with slurmdbd.
The hostname are the services they run (populate /etc/hostname and /etc/hosts accordingly). slurm.conf and slurmdbd.conf are below.
They all share the same /etc/munge/munge.key file. Make sure munged is running everywhere (update-rc.d munge enable).
/etc/slurm-llnl/slurm.conf:
ControlMachine=slurmctld AuthType=auth/munge CryptoType=crypto/munge MpiDefault=none ProctrackType=proctrack/pgid ReturnToService=1 SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/lib/slurm-llnl/slurmd SlurmUser=slurm StateSaveLocation=/var/lib/slurm-llnl/slurmctld SwitchType=switch/none TaskPlugin=task/none InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0 FastSchedule=1 SchedulerType=sched/backfill SchedulerPort=7321 SelectType=select/linear AccountingStorageEnforce=association AccountingStorageHost=slurmdbd AccountingStorageType=accounting_storage/slurmdbd AccountingStoreJobComment=YES ClusterName=cluster JobCompType=jobcomp/linux JobAcctGatherFrequency=30 JobAcctGatherType=jobacct_gather/none SlurmctldDebug=3 SlurmctldLogFile=/var/log/slurm-llnl/slurmctld.log SlurmdDebug=3 SlurmdLogFile=/var/log/slurm-llnl/slurmd.log NodeName=slurmd CPUs=2 State=UNKNOWN PartitionName=debug Nodes=slurmd Default=YES MaxTime=INFINITE State=UP
/etc/slurm-llnl/slurmdbd.conf:
AuthType=auth/munge AuthInfo=/var/run/munge/munge.socket.2 DbdHost=localhost DebugLevel=3 StorageHost=localhost StorageLoc=slurm StoragePass=shazaam StorageType=accounting_storage/mysql StorageUser=slurm LogFile=/var/log/slurm-llnl/slurmdbd.log PidFile=/var/run/slurm-llnl/slurmdbd.pid SlurmUser=slurm ArchiveDir=/var/log/slurm-llnl/ ArchiveEvents=yes ArchiveJobs=yes ArchiveResvs=yes ArchiveSteps=yes ArchiveSuspend=yes PurgeEventAfter=1hour PurgeJobAfter=1hour PurgeResvAfter=1hour PurgeStepAfter=1hour PurgeSuspendAfter=1hour
On slurmdbd create a MySQL database called slurm with write permission for user slurm with password shazaam:
CREATE DATABASE slurm; GRANT ALL PRIVILEGES ON slurm.* TO 'slurm' IDENTIFIED BY 'shazaam';
With sacctmgr (package slurm-client) add a cluster, an account and a user:
sacctmgr -i add cluster cluster sacctmgr -i add account oliva Cluster=cluster sacctmgr -i add user oliva Account=oliva
Then run a couple of jobs as user oliva with srun or sbatch: you can see them in the cluster history with sacct.
# nodes status slurmctld# sinfo # send job slurmctld# srun -l /bin/hostname # list jobs slurmctld# sacct # reset node (e.g. stuck in 'alloc' state) slurmctld# scontrol update NodeName=slurmd State=down reason=x slurmctld# scontrol update NodeName=slurmd State=resume
Given the settings of the slurmdbd.conf below, this job information are purged at the beginning of the hour after the job has run and are stored in two files called:
cluster_job_archive_2019-12-09T01:00:00_2019-12-09T01:59:59 cluster_step_archive_2019-12-09T01:00:00_2019-12-09T01:59:59
with the current date under /var/log/slurm-llnl/.
CVE-2019-12838 note: to reproduce, try to reload the files with the command:
sacctmgr archive load file=/var/log/slurm-llnl/...
See also https://slurm.schedmd.com/quickstart.html and https://slurm.schedmd.com/troubleshoot.html
