History

Running on the cluster¶

To run AlvisNLP/ML with make, you will have to be somewhat familiar the Oracle Grid Engine. It is recommended to read migale's introduction.
Information on qsub and qstat commands are available as man pages on migale.

Submit an AlvisNLP/ML job¶

On migale:

qsub [QSUB OPTIONS] alvisnlp [ALVISNLP OPTIONS] PLAN_FILE

Yep, it's that simple.

AlvisNLP/ML submits automatically with the -V and -cwd options. That means that the job will run in the same working directory and the same environment than when the job was submitted.
Additionally we strongly recommend you set the -e and -o options, otherwise AlvisNLP/ML standard output and error will be lost. It is good practice to set the -m and -M options, so you'll be notified on the status of your job.

Important note¶

You must make sure that all files necessary to run your plan are mounted on the cluster nodes. That includes:

/project
@/home/mig@_¹
/bibdev
/bibprod

That excludes:

/home0
/projet/alvis
/svn
system directories: /tmp, /usr, /var

Files and directories you must pay attention:

$JAVA_HOME
$ALVISNLP_HOME
all useful paths in $CLASSPATH
input and output files of the plan

Splitting a corpus¶

The main advantage of using the cluster is that you can process several corpora in parallel. If you have such a large corpus and that a single AlvisNLP/ML run will take too much time, or even won't run at all because of memory limits, then the cluster is the solution.
The trick is to split your corpus into several sub-corpora and submit a job for each sub-corpus. The main challenge are:

make each job read a different sub-corpus
make each job write on different output files
gather the results when all jobs are finished

Read a different sub-corpus¶

One way to achieve this would be to write a different plan for each job. The different plans would be nearly identical and would change only by the sourceDir parameter of the initial reader module. This solution is ugly and dangerous; if you have chages, then you must make sure you make them in all plans.

The preferred way is to use a single plan and to exploit the -param command line option. This option allows you to override the value of a parameter. Assuming your plan has a TextFileReader module with identifier reader, and that your corpus has been split into three directories foo, bar and toto, then the following commands will submit a job for each sub-corpus:

qsub alvisnlp -param reader sourceDir foo plan.xml
qsub alvisnlp -param reader sourceDir bar plan.xml
qsub alvisnlp -param reader sourceDir toto plan.xml

Write into different output files¶

If different jobs write concurrently on the same file, then the results are undefined and you'll end up with garbage.

use -param to set distinct output file and directory paths
use -log to set distinct log files
use -dumpModule to set sistinct dump files

Gathering results¶

In the best case, the global result is a concatenation of each job output. This is typically the case for export modules and ExpressionExtract.

Sometimes the global result may be easily computed from individual results. For instance, in order to compile several NewCount results, you'll have to sum the counts of each entry with a perl/awk/python script.

In the worst cases the global result cannot be computed from individual results. This is the case with YateaExtractor and TrainingElementClassifier.

¹ although /home/mig is mounted on the nodes, it is not recommended to access this file system intensively from the cluster

Files

Project

General

Profile

Bibliome » AlvisNLP/ML