Getting started
 
Development
Description
Getting started
Events
Project submission and scientific board
Publications and results
 
 
Items
Have an account on the cluster
Configure your workstation
First cluster execution with Escabooz
Running simulations in batch mode
 
  

Information, Multimodalité & Signal

Running simulations in batch mode
 
  by Fix Jeremy
 
 

The Intercell cluster can be used to run simulations in batch mode, for example when you have several experiments/binaries that you want to be executed by the cluster.

Let us call these experiments or binaries as jobs. There is a tool developed by Jean-Louis Gutzwiller which allows to easily run several tasks using whatever number of nodes of the cluster. In this article, we explain how to do this.

Philosophy

The philosophy is to run, on your machine, a server which knows the jobs to be executed. This machine must obviously stay online while the jobs get executed by the cluster. Say you have P jobs.

You then run N nodes on the cluster which will take the jobs in the order in which they are defined. The node will save whatever is written in standard outputs in a specific file for this job. Then the node ask the server for another job. And this process runs until the list of jobs gets empty.

So we need to explain :
- how to run a supelec-task-server on your machine
- how to run the nodes on the cluster
- how to collect the results of your simulations

At the end, we provide a runnable example.

Installing and running supelec-task-server in Linux

You can download the latest version of supelec-task-server :

GZ - 644.1 ko
supelec-task-server

In order to compile supelec-task-server on your linux machine. As root :

su -   <-OR->  sudo su    <-- this depends on the distribution
tar -zxvf supelec-task-server-1.06.tar.gz
cd supelec-task-server-1.06/
./reconf
./configure --prefix=/usr
make
make install

You should then be able to run supelec-task-server and access to the documentation

supelec-task-server --help

It should open a PDF document ServeurTaches.pdf . This documentation is in French but we anyway explain the main way to use supelec-task-server in this article.

Usage

The supelec-task-server binary must be executed on a machine to which the cluster can communicate and is run as :

supelec-task-server port task_file result_file [--delay t(ms)]

We focus only on the main options. Check the help for others.

supelec-task-server uses opens 3 consecutive ports from port : port, port+1, port+2.
The task_file is the file containing the list of jobs to be executed. For example, suppose there is a binary called my_program which takes one argument and we want to run several instances with different parameters. Let us write the jobs in the file toto.job with the content

Each line is the raw command that the nodes on the cluster will receive and will execute.

The result_file will contain the output of your simulations emited in the standard output. Warning The result file is filled as the jobs get completed. There is therefore no guarantee that the results will be ordered like the jobs. But you can easily append a prefix to the your results identifying the tested condition. There is one thing for sure, the results do not get mixed in the result file and therefore your results are not segmented within the result file.

The —delay t option allows to enforce a delay when multiple nodes ask for a job at the same time. This might be particularly relevant if your jobs generate random numbers with a seed set to the current time. You can then introduce a delay to be sure that no job will use the same seed.

Another way of collecting the results of your simulations

As mentioned previously, you specify a result file in which the results outputed in the standard output (and standard error if you specify an error file to supelec-task-server).

But I prefer another way to collect the results which is to set up your binaries so that they accept an id for the job. You can then dump your results in a file with a unique id.

Running the nodes on the cluster

Now that we have a supelec-task-server running and ready to distribute its jobs, we need to run the nodes on the cluster. The nodes will be executed from a frontal machine (e.g. vera.ic). For this, we propose to use the oarsub_tasks bash script that you should place in a PATH accessible when you are logged on a frontal machine.

Zip - 720 octets

Suppose that supelec-task-server is running on the host toto.machine, you can then run the jobs from a frontal node :

ssh mylogin@vera.ic
for (( i = 0 ; i <= 100 ; i++)) ; do
     oarsub_tasks toto.machine
done

To check out the status of you jobs , from a frontal node (e.g. vera.ic)

oarstat

To kill all the jobs you are currently running, from a frontal node (e.g. vera.ic)

oarstat | grep `whoami` | awk ' { print $1 } ' | xargs oardel

An example with a Matlab job

I propose you to run a matlab script several times. The matlab script is the following :

Zip - 398 octets

Copy this script on the home of your account on the cluster.

We now generate a task file. I propose you to use the following python script to generate the task file.

Zip - 376 octets

On your local machine, run this script :

python generate_task.py 100

This produces a task list task.jobs with 100 calls to myfunction.

Now, on your local machine, start the supelec-task-server :

user@local:~$ supelec-task-server 5000 task.jobs results.jobs errors.jobs --delay 1000

You can now log onto a frontal machine, create the directory that will contain the results and run the nodes.

user@local:~$ ssh vera.ic
user@lvera.ic:~$ mkdir Results
user@lvera.ic:~$ for (( i = 0 ; i <= 50 ; i++)) ; do      oarsub_tasks local.metz.supelec.fr; done
[ADMISSION RULE] Modify resource description with type constraints
OAR_JOB_ID=707020
[ADMISSION RULE] Modify resource description with type constraints
OAR_JOB_ID=707021
[ADMISSION RULE] Modify resource description with type constraints
OAR_JOB_ID=707022
[ADMISSION RULE] Modify resource description with type constraints
OAR_JOB_ID=707023
...........

You should see a list of jobs running on the cluster by calling oarstat on vera.ic.

You can also see the result files being created in Results. I also suggest you to check out the content of the results.jobs and errors.jobs file ;

When all the jobs have been completed, you can kill the supelec-task-server.