Project

General

Profile

Cheating

Observe stfilter.plan and common.plan

1.
The main advantage of sequence is to organize the plan into sub-tasks. A plan with broken into sensible sequence may be easier to understand.

import allows to re-use plans inside more complete plans. A good library of importable plans saves time by avoiding to re-write boilerplate plans.

2.
Modules documentation will provide some information on each parameter.

Module Parameter File Flow
common.read.positive sourcePath resources/Bsub_pos.txt in
common.read.negative sourcePath resources/Bsub_neg.txt in
common.named-entities.taxa dictFile /bibdev/ressources/entites_nommees/ncbi_mig_taxon_names/automatic/taxa_full.txt in
common.named-entities.genes dictFile resources/genes.txt in
common.pos treeTaggerExecutable /bibdev/install/tree-tagger-3.2/bin/tree-tagger exe
common.pos parFile /bibdev/install/tree-tagger-3.2/lib/english.par in
common.extract.taxa outFile output/taxa.txt out
common.extract.words outFile output/words.txt out
train relationDefinition attr/base.xml in
train classifierFile output/stfilter.classifier out
train evaluationFile output/stfilter.eval out
train arffFile output/stfilter.arff out
train extractResults output/prediction.txt out

3.
The entryFeatureNames parameter specifies the feature names in which the 2nd, 3rd and 4th columns in the file /bibdev/ressources/entites_nommees/ncbi_mig_taxon_names/automatic/taxa_full.txt should be stored. The first column is the searched key. They represent the NCBI taxon identifier, the POS tag and the taxonomic rank of the taxon.

4.
Use ExpressionExtract, the same module type used to export taxon names and words.
In common.plan, inside the extract sequence, add the following module:

<module id="genes" class="ExpressionExtract">
  <target value="documents.sections.layer:genes"/>
  <fields value="form"/>
  <outFile value="output/genes.txt"/>
</module>

5.
The module fixed-forms prepares a layer (fixed-forms) that will be used later by the word segmentation to specify annotations that should not be broken. Usually these are used to ensure the atomicity of rigid designators.
The module fixed-forms-overlaps ensures that the fixed-forms layer does not contain overlapping annotations when words are segmented.

6.
The classified elements are specified by the examples parameter of the module train in stfilter.plan.
The classified elements are documents.

7.
Read output/stfilter.eval.
Check Weka's documentation, especially subclasses of Classifier class

8.
Learning attributes are specified by the relationDefinition parameter of the module train in stfilter.plan.

Observe attr/base.xml

1.
One that is called length.
Weka requires that the predicted class is encoded as an attribute. In base.xml the class attribute is called class and has the class attribute set to yes.
More details about how to define learning attributes:

2.
length is a numeric attribute.
Its value is the number of words in all sections of the example document.

3.
bag generates a set of boolean attributes that behave like a bag of words:
  • prefix specifies the prefix of all generated attributes
  • loadValues specifies the file where to find the list of words to test, there are as many generated attributes than lines in the specified file
  • feature specifies what feature of the bag must be tested against the values in loadValues

The contents of bag specifies the bag of words to consider for each example. In the case of attr/bow.xml it is the set of words in all sections of the example document. In the case of attr/vici.xml it is the set of words within a window of length 2 around gene names.

4.
Set the relationDefinition parameter to attr/bow.xml, then run AlvisNLP/ML and look for the results in output/stfilter.eval.
Do the same for attr/vici.xml and compare the results.

5.
Change the following parameters:
  • fields in common.extract.words to lemma
  • feature in attr/bow.xml to lemma

6.
Do your homework.

Using dependencies

1. and 2.
attachment:common_ccg.plan

3.
The following command will run the plan normally, but additionally dumps the corpus state in the output/parsed.dmp file.
alvisnlp -dumpModule common.parse output/parsed.dmp stfilter.plan

This corpus state can be used in following invocations of AlvisNLP/ML:
alvisnlp -resume output/parsed.dmp stfilter.plan
Notice that all modules until common.parse are skipped. This is useful to shorten the experiment cycle iff the plan does not change before common.parse.

Command-line options can be listed by typing: alvisnlp -help

4.
The file attachment:vicidep.xml uses the bag of words attached to a dependency to a gene name.

Expressions can be tricky, see Element Expression and Element Expression Examples.