Project

General

Profile

Using AlvisNLPML on Migale

AlvisNLP/ML installed on bibdev

There are two versions installed on bibdev:

/bibdev/install/alvisnlp/latest/bin/alvisnlp

The latest numbered version of AlvisNLP/ML. It passes regression tests and all modules are fully documented. This version is quite old and may not include all nice features.
To set the PATH in order to use this version, run /bibdev/install/alvisnlp/latest/environ.sh.

Note: the current latest version is really old. Don't use it.

/bibdev/install/alvisnlp/devel/bin/alvisnlp

The latest revision of AlvisNLP/ML that passes regression tests. All modules are not necessarily documented.
To set the PATH in order to use this version, run /bibdev/install/alvisnlp/devel/environ.sh.

On migale

Connect as user textemig:

/projet/maiage/work/textemig/software/install/alvisnlp

There is a symlink in textemig's home, a different path to the same directory:

/projet/maiage/save/textemig/projet-work/software/install/alvisnlp

Submitting AlvisNLP/ML jobs on the Grid Engine

The scenario assumed here is the processing of a large corpus with a single AlvisNLP/ML plan.

Find a shared disk space

All resources must be reachable from migale and the cluster nodes (named nXX). This includes:

  • the AlvisNLP/ML install directory
  • all external executables used with the plan (treetagger, yatea, etc.), usually in /projet/maiage/work/textemig/software/
  • the corpus files
  • all resources used by the modules (lexicons, classifiers, etc.), usually in /projet/maiage/work/textemig/resources/
  • directories where output will be written

The following directories are mounted on migale and all nodes:

  • /projet/mig/work: this is recommended to place as much data as possible in this directory

Split the corpus

Split the corpus. There is no general rule over the size of corpus chunks. Usually you will have to balance between more smaller chunks or fewer larger chunks.

Advantages of smaller chunks:

  • more parallelism
  • easier re-run if the processing of some chunks failed
  • in some cases, smaller chunks have less chance to fail (less out-of-memory errors, for instance)

Advantages of larger chunks:

  • anyway, parallelism is limited to the amount of slots in the Grid Engine anyway
  • spend less time in warm-up (resource loading)

Tools for splitting

  • split is a GNU/Unix command that splits text files by chunks of equal number of lines
  • /projet/maiage/work/textemig/software/pubmed-utils/split-pubmed.py is a script that splits PubMed XML files

Prepare a cluster-ready plan

The same plan will be used for the different chunks. However there are parameters that will be specific of each chunk:

  • parameters that specify the source in the reader module (obviously)
  • parameters that specify the output directory or file for modules that write stuff (TabularExport, AlvisIRIndexer, etc.)

In the variable parts of the parameters, one can use a custom XML entity:

<module id="reader" class="TexFileReader">
  <sourcePath>&inputfile;</sourcePath>
</module>

...

<module id="export" class="TabularExport">
  <outDir>output</outDir>
  <fileName>"&outputfile;"</fileName>
</module>

The alvisnlp command-line allows to specify de value of XML entities.

A different entity can be used for each variable part, however we recommend to create a directory for each chunk, and place the input and output files in the corresponding chunk directory. In this way a single custom entity is required:

<module id="reader" class="TexFileReader">
  <sourcePath>&chunk;/input.txt</sourcePath>
</module>

...

<module id="export" class="TabularExport">
  <outDir>&chunk;</outDir>
  <fileName>"output.txt"</fileName>
</module>

Submission command-line

qsub [qsub options] alvisnlp [alvisnlp options]

Recommended qsub options

-j yes and -o /dev/null

-j yes merges the standard output and standard error, -o /dev/null sends them both to nowhere.
Otherwise the scheduler creates a file for both streams for each submissions, these files are cumbersome and useless with alvisnlp.

-q short.q

Submit in the short processes queue. The jobs are killed after 4 hours running. In the remote case your jobs need more time, submit to long.q.

-cwd

Run the job using the same working directory as when the job was submitted.
If this option is not specified then the job's current directory is your home, all your relative paths will be broken.

-V

Exports the environment variables to the job's environment.
If this option is not specified then the job's environment has only $HOME (not even $PATH).

-N XXXX

Sets the name of the job.
If this option is not specified then the job name is "alvisnlp", which is not very practical for monitoring lots of jobs.
We recommend a name based on the chunk number.

Recommended alvisnlp options

-verbose and -log XXXX

We recommend a different log file for each chunk.

-cleanTmp

Erases the temporary directory when the processing is finished.
If this option is not specified then the Grid Engine administrator will snap at you.

-entity chunk XXXX

Specifies the value of an entity.
Several entities can be specified.

Submitting everything at once

Submitting the jobs for all chunks will look like that:

qsub -j yes -o /dev/null -q short.q -cwd -V -N chunk1 alvisnlp -cleanTmp -verbose -log chunk1/alvisnlp.log -entity chunk chunk1 plan.xml
qsub -j yes -o /dev/null -q short.q -cwd -V -N chunk2 alvisnlp -cleanTmp -verbose -log chunk2/alvisnlp.log -entity chunk chunk2 plan.xml
qsub -j yes -o /dev/null -q short.q -cwd -V -N chunk3 alvisnlp -cleanTmp -verbose -log chunk3/alvisnlp.log -entity chunk chunk3 plan.xml
...

This sequence can obviously be scripted, however two problems must be dealt with:

  • qsub returns immediately, so you have to monitor the jobs yourself in order to know all jobs have finished
  • the only way to be aware of a job's failure is to look at alvisnlp's log file
  • some jobs fail non-deterministically (OutOfMemory), so you may have to re-submit failed jobs

Enter @/projet/maiage/work/textemig/software/misc-utils/qsync.py, a script that submits several jobs with the following features:

  • returns when all jobs have finished
  • automatically polls the scheduler at regular intervals, and provides verbose status
  • resubmits failed jobs (there's a re-submission limit)

qsync.py uses the drmaa library in order to communicate with the Grid Engine scheduler. This library reads the environment variable DRMAA_LIBRARY_PATH in order to locate the native library. On migale:

export DRMAA_LIBRARY_PATH=/opt/sge/lib/lx24-amd64/libdrmaa.so