History

Using AlvisNLPML on Migale¶

Table of contents
Using AlvisNLPML on Migale

AlvisNLP/ML installed on `bibdev`¶

There are two versions installed on bibdev:

/bibdev/install/alvisnlp/latest/bin/alvisnlp

The latest numbered version of AlvisNLP/ML. It passes regression tests and all modules are fully documented. This version is quite old and may not include all nice features.
To set the PATH in order to use this version, run /bibdev/install/alvisnlp/latest/environ.sh.

Note: the current latest version is really old. Don't use it.

/bibdev/install/alvisnlp/devel/bin/alvisnlp

The latest revision of AlvisNLP/ML that passes regression tests. All modules are not necessarily documented.
To set the PATH in order to use this version, run /bibdev/install/alvisnlp/devel/environ.sh.

On `migale`¶

Connect as user textemig:

/projet/maiage/work/textemig/software/install/alvisnlp

There is a symlink in textemig's home, a different path to the same directory:

/projet/maiage/save/textemig/projet-work/software/install/alvisnlp

Submitting AlvisNLP/ML jobs on the Grid Engine¶

The scenario assumed here is the processing of a large corpus with a single AlvisNLP/ML plan.

Find a shared disk space¶

All resources must be reachable from migale and the cluster nodes (named nXX). This includes:

the AlvisNLP/ML install directory
all external executables used with the plan (treetagger, yatea, etc.), usually in /projet/maiage/work/textemig/software/
the corpus files
all resources used by the modules (lexicons, classifiers, etc.), usually in /projet/maiage/work/textemig/resources/
directories where output will be written

The following directories are mounted on migale and all nodes:

/projet/mig/work: this is recommended to place as much data as possible in this directory

Split the corpus¶

Split the corpus. There is no general rule over the size of corpus chunks. Usually you will have to balance between more smaller chunks or fewer larger chunks.

Advantages of smaller chunks:

more parallelism
easier re-run if the processing of some chunks failed
in some cases, smaller chunks have less chance to fail (less out-of-memory errors, for instance)

Advantages of larger chunks:

anyway, parallelism is limited to the amount of slots in the Grid Engine anyway
spend less time in warm-up (resource loading)

Tools for splitting¶

split is a GNU/Unix command that splits text files by chunks of equal number of lines
/projet/maiage/work/textemig/software/pubmed-utils/split-pubmed.py is a script that splits PubMed XML files

Prepare a cluster-ready plan¶

The same plan will be used for the different chunks. However there are parameters that will be specific of each chunk:

parameters that specify the source in the reader module (obviously)
parameters that specify the output directory or file for modules that write stuff (TabularExport, AlvisIRIndexer, etc.)

In the variable parts of the parameters, one can use a custom XML entity:

<module id="reader" class="TexFileReader">
  <sourcePath>&inputfile;</sourcePath>
</module>

...

<module id="export" class="TabularExport">
  <outDir>output</outDir>
  <fileName>"&outputfile;"</fileName>
</module>

The alvisnlp command-line allows to specify de value of XML entities.

A different entity can be used for each variable part, however we recommend to create a directory for each chunk, and place the input and output files in the corresponding chunk directory. In this way a single custom entity is required:

<module id="reader" class="TexFileReader">
  <sourcePath>&chunk;/input.txt</sourcePath>
</module>

...

<module id="export" class="TabularExport">
  <outDir>&chunk;</outDir>
  <fileName>"output.txt"</fileName>
</module>

Submission command-line¶

qsub [qsub options] alvisnlp [alvisnlp options]

Recommended `qsub` options¶

-j yes and -o /dev/null

-j yes merges the standard output and standard error, -o /dev/null sends them both to nowhere.
Otherwise the scheduler creates a file for both streams for each submissions, these files are cumbersome and useless with alvisnlp.

-q short.q

Submit in the short processes queue. The jobs are killed after 4 hours running. In the remote case your jobs need more time, submit to long.q.

-cwd

Run the job using the same working directory as when the job was submitted.
If this option is not specified then the job's current directory is your home, all your relative paths will be broken.

-V

Exports the environment variables to the job's environment.
If this option is not specified then the job's environment has only $HOME (not even $PATH).

-N XXXX

Sets the name of the job.
If this option is not specified then the job name is "alvisnlp", which is not very practical for monitoring lots of jobs.
We recommend a name based on the chunk number.

Recommended `alvisnlp` options¶

-verbose and -log XXXX

We recommend a different log file for each chunk.

-cleanTmp

Erases the temporary directory when the processing is finished.
If this option is not specified then the Grid Engine administrator will snap at you.

-entity chunk XXXX

Specifies the value of an entity.
Several entities can be specified.

Submitting everything at once¶

Submitting the jobs for all chunks will look like that:

qsub -j yes -o /dev/null -q short.q -cwd -V -N chunk1 alvisnlp -cleanTmp -verbose -log chunk1/alvisnlp.log -entity chunk chunk1 plan.xml
qsub -j yes -o /dev/null -q short.q -cwd -V -N chunk2 alvisnlp -cleanTmp -verbose -log chunk2/alvisnlp.log -entity chunk chunk2 plan.xml
qsub -j yes -o /dev/null -q short.q -cwd -V -N chunk3 alvisnlp -cleanTmp -verbose -log chunk3/alvisnlp.log -entity chunk chunk3 plan.xml
...

This sequence can obviously be scripted, however two problems must be dealt with:

qsub returns immediately, so you have to monitor the jobs yourself in order to know all jobs have finished
the only way to be aware of a job's failure is to look at alvisnlp's log file
some jobs fail non-deterministically (OutOfMemory), so you may have to re-submit failed jobs

Enter @/projet/maiage/work/textemig/software/misc-utils/qsync.py, a script that submits several jobs with the following features:

returns when all jobs have finished
automatically polls the scheduler at regular intervals, and provides verbose status
resubmits failed jobs (there's a re-submission limit)

qsync.py uses the drmaa library in order to communicate with the Grid Engine scheduler. This library reads the environment variable DRMAA_LIBRARY_PATH in order to locate the native library. On migale:

export DRMAA_LIBRARY_PATH=/opt/sge/lib/lx24-amd64/libdrmaa.so

Files

Project

General

Profile

Bibliome » AlvisNLP/ML

Wiki

Using AlvisNLPML on Migale¶

AlvisNLP/ML installed on `bibdev`¶

On `migale`¶

Submitting AlvisNLP/ML jobs on the Grid Engine¶

Find a shared disk space¶

Split the corpus¶

Tools for splitting¶

Prepare a cluster-ready plan¶

Submission command-line¶

Recommended `qsub` options¶

Recommended `alvisnlp` options¶

Submitting everything at once¶

Project

General

Profile

Bibliome » AlvisNLP/ML

Wiki

Using AlvisNLPML on Migale¶

AlvisNLP/ML installed on bibdev¶

On migale¶

Submitting AlvisNLP/ML jobs on the Grid Engine¶

Find a shared disk space¶

Split the corpus¶

Tools for splitting¶

Prepare a cluster-ready plan¶

Submission command-line¶

Recommended qsub options¶

Recommended alvisnlp options¶

Submitting everything at once¶

AlvisNLP/ML installed on `bibdev`¶

On `migale`¶

Recommended `qsub` options¶

Recommended `alvisnlp` options¶