Using AlvisNLPML on Migale¶
- Table of contents
- Using AlvisNLPML on Migale
AlvisNLP/ML installed on bibdev
¶
There are two versions installed on bibdev
:
/bibdev/install/alvisnlp/latest/bin/alvisnlp
The latest numbered version of AlvisNLP/ML. It passes regression tests and all modules are fully documented. This version is quite old and may not include all nice features.
To set the PATH
in order to use this version, run /bibdev/install/alvisnlp/latest/environ.sh
.
Note: the current latest version is really old. Don't use it.
/bibdev/install/alvisnlp/devel/bin/alvisnlp
The latest revision of AlvisNLP/ML that passes regression tests. All modules are not necessarily documented.
To set the PATH
in order to use this version, run /bibdev/install/alvisnlp/devel/environ.sh
.
On migale
¶
Connect as user textemig
:
/projet/maiage/work/textemig/software/install/alvisnlp
There is a symlink in textemig
's home, a different path to the same directory:
/projet/maiage/save/textemig/projet-work/software/install/alvisnlp
Submitting AlvisNLP/ML jobs on the Grid Engine¶
The scenario assumed here is the processing of a large corpus with a single AlvisNLP/ML plan.
Find a shared disk space¶
All resources must be reachable from migale
and the cluster nodes (named nXX
). This includes:
- the AlvisNLP/ML install directory
- all external executables used with the plan (
treetagger
,yatea
, etc.), usually in/projet/maiage/work/textemig/software/
- the corpus files
- all resources used by the modules (lexicons, classifiers, etc.), usually in
/projet/maiage/work/textemig/resources/
- directories where output will be written
The following directories are mounted on migale
and all nodes:
/projet/mig/work
: this is recommended to place as much data as possible in this directory
Split the corpus¶
Split the corpus. There is no general rule over the size of corpus chunks. Usually you will have to balance between more smaller chunks or fewer larger chunks.
Advantages of smaller chunks:
- more parallelism
- easier re-run if the processing of some chunks failed
- in some cases, smaller chunks have less chance to fail (less out-of-memory errors, for instance)
Advantages of larger chunks:
- anyway, parallelism is limited to the amount of slots in the Grid Engine anyway
- spend less time in warm-up (resource loading)
Tools for splitting¶
split
is a GNU/Unix command that splits text files by chunks of equal number of lines/projet/maiage/work/textemig/software/pubmed-utils/split-pubmed.py
is a script that splits PubMed XML files
Prepare a cluster-ready plan¶
The same plan will be used for the different chunks. However there are parameters that will be specific of each chunk:
- parameters that specify the source in the reader module (obviously)
- parameters that specify the output directory or file for modules that write stuff (
TabularExport
,AlvisIRIndexer
, etc.)
In the variable parts of the parameters, one can use a custom XML entity:
<module id="reader" class="TexFileReader">
<sourcePath>&inputfile;</sourcePath>
</module>
...
<module id="export" class="TabularExport">
<outDir>output</outDir>
<fileName>"&outputfile;"</fileName>
</module>
The alvisnlp
command-line allows to specify de value of XML entities.
A different entity can be used for each variable part, however we recommend to create a directory for each chunk, and place the input and output files in the corresponding chunk directory. In this way a single custom entity is required:
<module id="reader" class="TexFileReader">
<sourcePath>&chunk;/input.txt</sourcePath>
</module>
...
<module id="export" class="TabularExport">
<outDir>&chunk;</outDir>
<fileName>"output.txt"</fileName>
</module>
Submission command-line¶
qsub [qsub options] alvisnlp [alvisnlp options]
Recommended qsub
options¶
-j yes
and -o /dev/null
-j yes
merges the standard output and standard error, -o /dev/null
sends them both to nowhere.
Otherwise the scheduler creates a file for both streams for each submissions, these files are cumbersome and useless with alvisnlp
.
-q short.q
Submit in the short processes queue. The jobs are killed after 4 hours running. In the remote case your jobs need more time, submit to long.q
.
-cwd
Run the job using the same working directory as when the job was submitted.
If this option is not specified then the job's current directory is your home, all your relative paths will be broken.
-V
Exports the environment variables to the job's environment.
If this option is not specified then the job's environment has only $HOME
(not even $PATH
).
-N XXXX
Sets the name of the job.
If this option is not specified then the job name is "alvisnlp
", which is not very practical for monitoring lots of jobs.
We recommend a name based on the chunk number.
Recommended alvisnlp
options¶
-verbose
and -log XXXX
We recommend a different log file for each chunk.
-cleanTmp
Erases the temporary directory when the processing is finished.
If this option is not specified then the Grid Engine administrator will snap at you.
-entity chunk XXXX
Specifies the value of an entity.
Several entities can be specified.
Submitting everything at once¶
Submitting the jobs for all chunks will look like that:
qsub -j yes -o /dev/null -q short.q -cwd -V -N chunk1 alvisnlp -cleanTmp -verbose -log chunk1/alvisnlp.log -entity chunk chunk1 plan.xml qsub -j yes -o /dev/null -q short.q -cwd -V -N chunk2 alvisnlp -cleanTmp -verbose -log chunk2/alvisnlp.log -entity chunk chunk2 plan.xml qsub -j yes -o /dev/null -q short.q -cwd -V -N chunk3 alvisnlp -cleanTmp -verbose -log chunk3/alvisnlp.log -entity chunk chunk3 plan.xml ...
This sequence can obviously be scripted, however two problems must be dealt with:
qsub
returns immediately, so you have to monitor the jobs yourself in order to know all jobs have finished- the only way to be aware of a job's failure is to look at
alvisnlp
's log file - some jobs fail non-deterministically (
OutOfMemory
), so you may have to re-submit failed jobs
Enter @/projet/maiage/work/textemig/software/misc-utils/qsync.py
, a script that submits several jobs with the following features:
- returns when all jobs have finished
- automatically polls the scheduler at regular intervals, and provides verbose status
- resubmits failed jobs (there's a re-submission limit)
qsync.py
uses the drmaa
library in order to communicate with the Grid Engine scheduler. This library reads the environment variable DRMAA_LIBRARY_PATH
in order to locate the native library. On migale
:
export DRMAA_LIBRARY_PATH=/opt/sge/lib/lx24-amd64/libdrmaa.so