Description | Installation of Slurm on centos 7 |
---|---|
Related-course materials | HPC Administration Module2 |
Authors | Ndomassi TANDO (ndomassi.tando@ird.fr) |
Creation Date | 23/09/2019 |
Last Modified Date | 23/09/2019 |
Summary
Definition
Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters.
https://slurm.schedmd.com/
Authentication and databases:
Create the user for munge and slurm:
Slurm and Munge require consistent UID and GID across every node in the cluster. For all the nodes, before you install Slurm or Munge:
Munge Installation for authentication:
Create a munge authentication key:
Copy the munge authentication key on every node:
Set the rights:
Enable and Start the munge service with:
Test munge from the master node:
Mariadb installation and configuration
Install mariadb with the following command:
Activate and start the mariadb service:
secure the installation:
Launch the following command to set up the root password an secure mariadb:
Modify the innodb configuration:
Setting innodb_lock_wait_timeout,innodb_log_file_size and innodb_buffer_pool_size to larger values than the default is recommended.
To do that, create a the file /etc/my.cnf.d/innodb.cnf
with the following lines:
To implement this change you have to shut down the database and move/remove logfiles:
Slurm installation:
Install the following prerequisites:
Retrieve the tarball
Create the RPMs:
RPMs are located in /root/rpmbuild/RPMS/x86_64/
Install slurm on master and nodes
In the RPMs’folder, launch the following command:
Create and configure the slurm_acct_db database:
Configure the slurm db backend:
Modify the /etc/slurm/slurmdbd.conf
with the following parameters:
Then enable and start the slurmdbd service
This will populate the slurm_acct_db with tables
Configuration file /etc/slurm/slurm.conf:
using the command lscpu
on each node to get processors’ informations.
Visit http://slurm.schedmd.com/configurator.easy.html to make a configuration file for Slurm.
Modify the following parameters in /etc/slurm/slurm.conf
to match with your cluster:
Now that the server node has the slurm.conf and slurmdbd.conf correctly filled, we need to send these filse to the other compute nodes.
Create the folders to host the logs
On the master node:
On the compute nodes:
test the configuration:
You should get something like:
Launch the slurmd service on the compute nodes:
Launch the slurmctld service on the master node:
Change the state of a node from down to idle
Where nodeX is the name of your node
Configure usage limits
Modify the /etc/slurm/slurm.conf file
Modify the AccountingStorageEnforce
parameter with:
Copy the modified file to the several nodes
Restart the slurmctld service to validate the modifications:
Create a cluster:
The cluster is the name we want for your slurm cluster.
It is defined in the /etc/slurm/slurm.conf
file with the line
To set usage limitations for your users, you first have to create an accounting cluster with the command:
Create an accounting account
An accounting account is a group under slurm that allows the administrator to manage the users rights to use slurm.
Example: you can create a account to group the bioinfo teams members
You can create a account to group the peaople allow to use the gpu partition
Create a user account
You have to create slurm user to make them be able to launch slurm jobs.
Modify a user account to add it to another accounting account:
Modify a node definition
Add the amount of /scratch partition
In the file /etc/slurm/slurm.conf
Modify the TmpFS file system
Add the TmpDisk value for /scratch
The TmpDisk is the size of the scratch in MB , you have to add in the line starting with NodeName
For example for a node with a 3TB disk:
Modify a partition definition
You have to modify the line starting with PartitionName in the file /etc/slurm/slurm.conf
.
Several options are available according to what you want
Add a time limit for running jobs (MaxTime)
A limitation time on partitions allows slurm to manage priorities between jobs on the same node.
You have to add it in the PartitionName line with the amount of time in minutes.
For example a partition with a 1 day max time the partition definition will be:
Add a Max Memory per CPU (MaxMemPerCPU)
As memory is a consumable resource MaxMemPerCPU serves not only to protect the node’s memory but will also automatically increase a job’s core count on submission where possible
You have to add it in the PartitionName line with the amount of memory in Mb.
This is normally set to MaxMem/NumCores
for example 2GB/CPU, the partition definition will be
Links
- Related courses : HPC Trainings