Batch jobs

From HP-SEE Wiki

Revision as of 10:27, 18 April 2012 by Dusan (Talk | contribs)
Jump to: navigation, search

Contents

Maui/Torque

Section contributed by IPB

Maui Cluster Scheduler is an open source job scheduler for clusters and supercomputers. It works coupled with a resource manager, such as TORQUE, Loadleveler, or LSF. MAUI scheduler tells the resource manager what to do, when to run jobs, and where, in order to utilize compute resources in a fair and effective manner and to maximize the output of the cluster.

PARADOX Cluster at Institute of Physics Belgrade uses 3.2.6p21 version of MAUI scheduler which works in conjunction with the 2.3.6-2 version of TORQUE resource manager. Here is described usage of MAUI/TORQUE batch system at PARADOX cluster.

From PARADOX gateway machine (ui.ipb.ac.rs) jobs can be submitted using the qsub command. The batch system server node (ce64.ipb.ac.rs) accepts submitted jobs and distribute them to PARADOX cluster worker nodes for the execution. To submit a batch job, the shell script with a set of batch directives with description of needed resources for the job, and with lines necessary to execute your code has to be prepared. After submission, the job will enter into a batch queue, and when requested resources become available, job will be launched over allocated nodes. The batch system provides monitoring of all submitted jobs. Within the HP-SEE project hpsee queue is available for HP-SEE user jobs.

Frequently used PBS commands for getting the status of the system, queues, or jobs are illustrated in the table below.

Command Description
qstat list information about queues and jobs
qstat –q list all queues on system
qstat –Q list queue limits for all queues
qstat –a list all jobs on system
qstat -au userID list all jobs owned by user userID
qstat –s list all jobs with status comments
qstat -r list all running jobs
qstat -f jobID list all information known about specified job
qstat -n in addition to the basic information, nodes allocated to a job are listed
qstat -Qf <queue> list all information about specified queue
qstat -B list summary information about the PBS server
qdel jobID delete the batch job with jobID
qalert alter a batch job
qsub submit a job

The following script can be used for serial job submission at PARADOX cluster:

 #!/bin/bash	
 #PBS -q hpsee
 #PBS -l nodes=1:ppn=1
 #PBS -l walltime=00:10:00
 #PBS -e ${PBS_JOBID}.err
 #PBS -o ${PBS_JOBID}.out
 
 cd $PBS_O_WORKDIR
 chmod +x job.sh
 ./job.sh

The line #!/bin/bash specifies which shell will be used by job, #PBS -q <queue> directs the job to the specified queue, #PBS -o/-e <name> redirects standard output/standard error output to the file <name> (in this case in ${PBS_JOBID}.out). $PBS_JOBID is an environment variable created by PBS that contains the PBS job identifier. The line #PBS -l walltime=<time> defines maximum wall-clock time, while the cd $PBS_O_WORKDIR change the default directory to the initial working directory.

PARADOX cluster has installed several implementation of МРI. All of them are installed in /opt/<MPI_VERSION> directories, and has its own environment variables:

  • mpich-1.2.7p1
    MPI_MPICH_MPIEXEC=/opt/mpiexec-0.83/bin/mpiexec
    MPI_MPICH_PATH=/opt/mpich-1.2.7p1
  • mpich2-1.1.1p1
    MPI_MPICH2_MPIEXEC=/opt/mpiexec-0.83/bin/mpiexec
    MPI_MPICH2_PATH=/opt/mpich2-1.1.1p1
  • openmpi-1.2.5
    MPI_OPENMPI_MPIEXEC=/opt/openmpi-1.2.5/bin/mpiexec
    MPI_OPENMPI_PATH=/opt/openmpi-1.2.5

MPI_<MPI_VERSION>_MPIEXEC defines MPI launcher for specific MPI implementation:

  • MPICH and MPICH2 uses mpiexec-0.83 (MPI parallel job launcher for PBS)
  • OPENMPI - has its own version of mpiexec

Here is a sample MPI job PBS script:

 #!/bin/bash
 #PBS -q hpsee
 #PBS -l nodes=3:ppn=8
 #PBS -l walltime=00:10:00
 #PBS -e ${PBS_JOBID}.err
 #PBS -o ${PBS_JOBID}.out
 
 cd $PBS_O_WORKDIR
 chmod +x job
 cat $PBS_NODEFILE
 ${MPI_MPICH_MPIEXEC} ./job                 # If mpich-1.2.7p1 is used
 ${MPI_MPICH2_MPIEXEC} --comm=pmi ./job     # If mpich2-1.1.1p1 is used
 ${MPI_OPENMPI_MPIEXEC} ./job               # If openmpi-1.2.5 is used

Depending of the MPI implementation used, appropriate environment variable (MPI_<MPI_VERSION>_MPIEXEC) should be used for launching of MPI jobs. MPI launcher, together with the batch system will take care of proper launching of a parallel job, i.e. no need to specify number of MPI instances to be launched or machine file in the command line. All these information launcher will obtain from the batch system.

Condor-HTC

Section contributed by UVT

Condor-HTC is a job scheduler dedicated for heterogeneous systems as it has the ability to connect different type of architecture in one administrative environment. Condor-HTC uses a job description file where the user specifies the job parameters like binary code to be executed, input data, output data and distributed environment specific parameters, like number of CPUs, job type, standard streams etc.

A normal workflow for submitting a job thourgh Condor-HTC is the following:

1. Compile the application source code to fit into the desired execution environment architecture. (for example x86_64, i386 etc.);

2. Prepare the input data and parameters for the simulation;

3. Check for the available resource on the Condor Pool

 condor_status

4. Create a job description file for the application (the following use case is for FuzzyCmeans application developed in the frame of the project):

 job_desc = "job_name"
 universe = parallel
 executable = mpi_wapper_for_fuzzycmeans
 initialdir = current_dir
 arguments = job_args
 machine_count = num_cpus
 output = outfile.$(NODE)
 error = errfile.$(NODE)
 log = condor.$(Process).log
 WantIOProxy = false
 ImageSize = 1
 Requirements = (Machine != "head.infragrid") && ( Memory > 1024 )
 queue

a. job_desc – job label;

b. universe – type of the job (parallel, universe, standard etc.);

c. executable – path to the binary code or only the name of it if the initialdir is defined; in case of parallel jobs a wrapper must be defined in order to preprocess the environment for MPI task;

d. arguments – application parameters;

e. machine_count – number of compute CPUs/cores;

f. output/error/log – standard output, error and default condor process log file;

g. WantIOProxy – usefull if you don’t have shared FS and the input and output files must be transferred from and into the worker nodes;

h. Requirements – special requirements to define like architecture, memory, node list etc.

5. Submit the above job to the scheduler using:

 condor_submit description_file

6. Check the status of the submitted job:

 condor_q

7. When the job is finalized you can check the output files for the results.

Sun Grid Engine

Section contributed by NIIFI

SGE is typically used on computer farms, or high-performance computing clusters, and is responsible for accepting, scheduling, dispatching, and managing the remote and distributed execution of large numbers of standalone, parallel or interactive user jobs. It also manages and schedules the allocation of distributed resources such as processors, memory, disk space, and software licenses. Several commands help the users to manage the scheduler system. Here we collected the most useful ones.

Resource status commands:

  • cluster queue summary: qstat –g c
  • resource availability per host: qhost –F
  • show the parallel environments: qconf –spl
  • show the available queues: qconf -sql

Job managment commands:

  • query the job status of all jobs: qstat –u \*
  • submit a job: qsub script.sh
  • job status query: qstat –j JOB_ID
  • query the reason why out job has been not scheduled yet: qalter -w v JOB_ID

Example MPI submit script:

 #!/bin/bash
 #$ -N CONNECTIVITY
 #$ -pe mpi 10
 mpirun -np $NSLOTS ./connectivity -v

The NSLOTS variable will contain that slots number what we requested for the parallel environment (10). The following table show the most common qsub parameters.

qsub parameter Example Meaning
-N name -N test The job name
-cwd -cwd The output and the error files will be created in the actual directory
-S shell -S /bin/bash The shell type
-j {y|n} -j y Concate the error and output files
-r {y|n} -r y Job should be or not restarted after restart
-M mail -M something@example.org Job state information will be send to this mail address
-m {b|e|a|s|n} -m bea The requested job states will be reported to the mail address
-l -l h_cpu=0:15:0 Wall time limit for the job
-l -l h_vmem=1G Memory limit for the job
-pe -pe mpi 10 This is site specific parameter which setup the requested parallel environment

There is a possibility to send array jobs to SGE. For example, you may have 1000 data sets, and you want to run a single program on them, using the cluster. Here is an example to send an array job:

 qsub -t 1-1000 array.sh

array.sh:

 #!/bin/sh
 #$ -N PI_ARRAY_TEST
 ./pi `expr $SGE_TASK_ID \* 100000`

The SGE_TASK_ID is an inner variable which is set by one by one in the jobs.

REFERENCES

IBM Loadleveler @ BG/P - BG

Section contributed by IICT-BAS

Jobs in BlueGene/P are scheduled for execution by system called LoadLeveler. The prepared jobs are submitted for execution via the command “llsubmit”, which receives something called Job Control File, which describes the executing program and its environment. This puts the job into a queue of waiting jobs. There are scheduling strategies via which the jobs are prioritized. When suitable resource is available, the next task in the queue will execute. You can see the contents of the queue with “llq”. If the status of the job shows that there is a problem, it is need to look at the error output and remove your job via “llcancel”. LoadLeveler generate JCF (job control file) file.

Example for JCF, named hello.jcf :

# @ job_name = hello
# @ comment = "This is a HelloWorld program"
# @ error = $(jobid).err
# @ output = $(jobid).out
# @ environment = COPY_ALL;
# @ wall_clock_limit = 01:00:00
# @ notification = never
# @ job_type = bluegene
# @ bg_size = 128
# @ class = n0128
# @ queue
/bgsys/drivers/ppcfloor/bin/mpirun -exe hello -verbose 1 -mode VN -np 512

To send it for execution:

llsubmit hello.jcf

There are some parameters, that can be added to the sending command:

-exe <executable_file> - the executable file itself; 
-args "<arguments>" - arguments to the executable file;
-verbose 1 - write information about job startup/finalizing in the stderr file
-mode VN|SMP|DUAL - provides the mode of execution;
-np N - the number of processes on which the job will execute;
-env BG_MAXALIGNEXP=-1 – very important parameter, which instructs the CN kernel to ignore alignment traps.

There are several important parameters, that can be used in JCF file:

Parameter Meaning
# @ job_name The name for the job, could be anything.
# @ comment Some comment.
# @ error Where to send the stderr. Writing to file descriptor 1 (in C) writes to this file.
# @ output Same here, but for the stdout (file descriptor 0 in C).
# @ environment This instructs LoadLeveler to copy the whole user environment when running the job. Thus, the job has the same environment as the user that executes llsubmit.
# @ wall_clock_limit Time limit; after this time, the job is cancelled automatically. This cannot be more than a certain time limit imposed by the class of the job.
# @ notification There is no infrastructure for notifications, so 'never' is a good value for this parameter.
# @ job_type This MUST be bluegene.
# @ bg_size This must be an integer, divisable by 128, but not larger than 2048. This gives the number of computing nodes that will be used in order to execute the job.This must correspond to the class of the job.
# @ class This is the class of the job. The most important parameter. Different classes have different priorities.
# @ queue This instructs LoadLeveler to put the job in the queue.

IBM Loadleveler @ BG/P - UVT

Section contributed by UVT

IBM Loadleveler is the main job scheduler shipped with any BlueGene/P Supercomputer. The usage of IBM Loadleveler is straight forward and similar with most of the best know job schedulers. IBM Loadleveler uses the notion of jobs and each job can be submitted for execution based on a job descriptor.

Using the job descriptor an user specifies the job parameters like binary code to be executed, input data, output data and BG/P environment specific parameters, like number of CPUs/cores, partition size, virtual environment type, standard streams etc.

A normal workflow on submitting a BG/P job using IBM Loadleveler has the following steps:

1. Prepare and compile the source code and all the dependencies for your application simulation;

2. Check if the BG/P machine is online

 llstatus
 Name                      Schedd InQ  Act Startd Run LdAvg Idle Arch      OpSys 
 fe-bg.hpc.uvt.ro          Avail     0   0 Idle     0 0.00  9999 PPC64     Linux2
 sn-bg.hpc.uvt.ro          Avail     0   0 Idle     0 0.00  9999 PPC64     Linux2
 PPC64/Linux2                2 machines      0  jobs      0  running tasks
 Total Machines              2 machines      0  jobs      0  running tasks
 The Central Manager is defined on sn-bg.hpc.uvt.ro
 The BACKFILL scheduler with Blue Gene support is in use
 Blue Gene is present
 All machines on the machine_list are present.

3. Depending on you job run time, select the appropriate run class:

 llclass

4. Create a loadleveler job description file to describe you simulation:

 #!/bin/sh
 # @ job_name = sfcmGridFragm_2iunie
 # @ job_type = bluegene# @ requirements = (Machine == "$(host)")
 # @ error = $(job_name)_$(jobid).err
 # @ output = $(job_name)_$(jobid).out
 # @ environment = COPY_ALL;
 # @ notification = always
 # @ notify_user = silviu@info.uvt.ro
 # @ wall_clock_limit = 3:59:00
 # @ class = parallel
 # @ bg_size = 16
 # @ queue
 /bgsys/drivers/ppcfloor/bin/mpirun -mode VN –cwd 
 "/GPFS/users/staff/panica/projects  /fuzzy/run/03.08.2011-11.12-5" –args  
 "config_sfcm.cnf 4 4" –exe 
 /GPFS/users/staff/panica/projects/fuzzy/run/03.08.2011-11.12-5/sfcmGridFragm_2iunie

• these options are specific to the FuzzyCmeans application simulations; for more detailed description file parameters please consult IBM Loadleveler handbook;

5. Submit the job to the BG/P machine:

 llsubmit job_description

6. Check job status (simple):

 llq

7. Check job status (complex):

 llstatus –l

8. Cancel the job:

 llcancel job _id

9. When the job stops running the output of the execution (including the log files) is available in the ‘-cwd’ defined directory.