Slurm: Simple Linux Utility Resource Manager

Overview:

Slurm has a centralized manager, slurmctld, to monitor resources and work.

Each compute server (node) has a slurmd daemon, which can be compared to a remote shell: it waits for work, executes that work, returns status, and waits for more work.

The slurmd daemons provide fault-tolerant hierarchical communications. There is an optional slurmdbd (Slurm DataBase Daemon) which can be used to record accounting information for multiple Slurm-managed clusters in a single database.

Installation:

Create SlurmUser

NOTE: The SlurmUser must exist prior to starting Slurm and must exist on all nodes of the cluster.

$ useradd slurm
$ passwd slurm

Install Related Software:

  • MUNGE: Authentication plugins identifies the user originating a message.
$ sudo apt install libmunge-dev libmunge2 munge

# Check munge service was started
$ systemctl status munge.service 
● munge.service - MUNGE authentication service
   Loaded: loaded (/lib/systemd/system/munge.service; enabled; vendor preset: enabled)
   Active: active (running) since 日 2018-05-06 07:35:02 CST; 1 day 3h ago
     Docs: man:munged(8)
  Process: 32208 ExecStart=/usr/sbin/munged (code=exited, status=0/SUCCESS)
 Main PID: 32211 (munged)
    Tasks: 4
   Memory: 376.0K
      CPU: 517ms
   CGroup: /system.slice/munge.service
           └─32211 /usr/sbin/munged

Download slurm source code: https://www.schedmd.com/downloads.php

git clone git://github.com/SchedMD/slurm.git

Install Slurm:

# If you download from stable source
$ tar -xaf slurm*tar.bz2  
$ cd slurm*tar.bz2

# Install slurm
$ ./configure
$ make
$ sudo make install

Copy configuration file and modify it:

# slurm.conf tool: https://slurm.schedmd.com/configurator.html
# You can reference via https://github.com/SchedMD/slurm/blob/master/etc/slurm.conf.example
$ cp etc/slurm.conf.example /usr/local/etc/slurm.conf
$ vim slurm.conf
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=slurm-master
ControlAddr=192.168.10.1
# 
#MailProg=/bin/mail 
MpiDefault=none
#MpiParams=ports=#-# 
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817 
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818 
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root 
StateSaveLocation=/var/spool
SwitchType=switch/none
TaskPlugin=task/none
# 
# 
# TIMERS 
#KillWait=30 
#MinJobAge=300 
#SlurmctldTimeout=120 
#SlurmdTimeout=300 
# 
# 
# SCHEDULING 
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/linear
#SelectTypeParameters=
# 
# 
# LOGGING AND ACCOUNTING 
AccountingStorageType=accounting_storage/none
ClusterName=cluster
#JobAcctGatherFrequency=30 
JobAcctGatherType=jobacct_gather/none
#SlurmctldDebug=3 
#SlurmctldLogFile=
#SlurmdDebug=3 
#SlurmdLogFile=
# 
# 
# COMPUTE NODES 
NodeName=slurm-master NodeAddr=192.168.10.1 CPUs=1 State=UNKNOWN 
PartitionName=debug Nodes=slurm-master Default=YES MaxTime=INFINITE State=UP

Start appropriate services on each system:

# Controller: systemctl enable slurmctld
$ sudo cp etc/slurmctld.service /etc/systemd/system
$ sudo systemctl enable slurmctld.service
$ sudo systemctl start  slurmctld.service


# Compute Nodes: systemctl enable slurmd
$ sudo cp etc/slurmd.service /etc/systemd/system
$ sudo systemctl enable slurmd.service
$ sudo systemctl start  slurmd.service

Check your slurm:

srun -l sleep 60 &
squeue

Reference:

https://slurm.schedmd.com/

https://slurm.schedmd.com/download.html

https://github.com/SchedMD/slurm

results matching ""

    No results matching ""