Project

General

Profile

Running on the cluster

To run AlvisNLP/ML with make, you will have to be somewhat familiar the Oracle Grid Engine. It is recommended to read migale's introduction.
Information on qsub and qstat commands are available as man pages on migale.

Submit an AlvisNLP/ML job

On migale:

qsub [QSUB OPTIONS] alvisnlp [ALVISNLP OPTIONS] PLAN_FILE

Yep, it's that simple.

AlvisNLP/ML submits automatically with the -V and -cwd options. That means that the job will run in the same working directory and the same environment than when the job was submitted.
Additionally we strongly recommend you set the -e and -o options, otherwise AlvisNLP/ML standard output and error will be lost. It is good practice to set the -m and -M options, so you'll be notified on the status of your job.

Important note

You must make sure that all files necessary to run your plan are mounted on the cluster nodes. That includes:
  • /project
  • @/home/mig@_1
  • /bibdev
  • /bibprod
That excludes:
  • /home0
  • /projet/alvis
  • /svn
  • system directories: /tmp, /usr, /var
Files and directories you must pay attention:
  • $JAVA_HOME
  • $ALVISNLP_HOME
  • all useful paths in $CLASSPATH
  • input and output files of the plan

Splitting a corpus

The main advantage of using the cluster is that you can process several corpora in parallel. If you have such a large corpus and that a single AlvisNLP/ML run will take too much time, or even won't run at all because of memory limits, then the cluster is the solution.
The trick is to split your corpus into several sub-corpora and submit a job for each sub-corpus. The main challenge are:
  • make each job read a different sub-corpus
  • make each job write on different output files
  • gather the results when all jobs are finished

Read a different sub-corpus

One way to achieve this would be to write a different plan for each job. The different plans would be nearly identical and would change only by the sourceDir parameter of the initial reader module. This solution is ugly and dangerous; if you have chages, then you must make sure you make them in all plans.

The preferred way is to use a single plan and to exploit the -param command line option. This option allows you to override the value of a parameter. Assuming your plan has a TextFileReader module with identifier reader, and that your corpus has been split into three directories foo, bar and toto, then the following commands will submit a job for each sub-corpus:

qsub alvisnlp -param reader sourceDir foo plan.xml
qsub alvisnlp -param reader sourceDir bar plan.xml
qsub alvisnlp -param reader sourceDir toto plan.xml

Write into different output files

If different jobs write concurrently on the same file, then the results are undefined and you'll end up with garbage.
  • use -param to set distinct output file and directory paths
  • use -log to set distinct log files
  • use -dumpModule to set sistinct dump files

Gathering results

In the best case, the global result is a concatenation of each job output. This is typically the case for export modules and ExpressionExtract.

Sometimes the global result may be easily computed from individual results. For instance, in order to compile several NewCount results, you'll have to sum the counts of each entry with a perl/awk/python script.

In the worst cases the global result cannot be computed from individual results. This is the case with YateaExtractor and TrainingElementClassifier.

1 although /home/mig is mounted on the nodes, it is not recommended to access this file system intensively from the cluster