Cheating¶
Observe stfilter.plan
and common.plan
¶
1.
The main advantage of sequence
is to organize the plan into sub-tasks. A plan with broken into sensible sequence
may be easier to understand.
import
allows to re-use plans inside more complete plans. A good library of importable plans saves time by avoiding to re-write boilerplate plans.
2.
Modules documentation will provide some information on each parameter.
Module | Parameter | File | Flow |
common.read.positive | sourcePath | resources/Bsub_pos.txt | in |
common.read.negative | sourcePath | resources/Bsub_neg.txt | in |
common.named-entities.taxa | dictFile | /bibdev/ressources/entites_nommees/ncbi_mig_taxon_names/automatic/taxa_full.txt | in |
common.named-entities.genes | dictFile | resources/genes.txt | in |
common.pos | treeTaggerExecutable | /bibdev/install/tree-tagger-3.2/bin/tree-tagger | exe |
common.pos | parFile | /bibdev/install/tree-tagger-3.2/lib/english.par | in |
common.extract.taxa | outFile | output/taxa.txt | out |
common.extract.words | outFile | output/words.txt | out |
train | relationDefinition | attr/base.xml | in |
train | classifierFile | output/stfilter.classifier | out |
train | evaluationFile | output/stfilter.eval | out |
train | arffFile | output/stfilter.arff | out |
train | extractResults | output/prediction.txt | out |
3.
The entryFeatureNames
parameter specifies the feature names in which the 2nd, 3rd and 4th columns in the file /bibdev/ressources/entites_nommees/ncbi_mig_taxon_names/automatic/taxa_full.txt
should be stored. The first column is the searched key. They represent the NCBI taxon identifier, the POS tag and the taxonomic rank of the taxon.
4.
Use ExpressionExtract, the same module type used to export taxon names and words.
In common.plan
, inside the extract
sequence, add the following module:
<module id="genes" class="ExpressionExtract">
<target value="documents.sections.layer:genes"/>
<fields value="form"/>
<outFile value="output/genes.txt"/>
</module>
5.
The module fixed-forms
prepares a layer (fixed-forms
) that will be used later by the word segmentation to specify annotations that should not be broken. Usually these are used to ensure the atomicity of rigid designators.
The module fixed-forms-overlaps
ensures that the fixed-forms
layer does not contain overlapping annotations when words are segmented.
6.
The classified elements are specified by the examples
parameter of the module train
in stfilter.plan
.
The classified elements are documents.
7.
Read output/stfilter.eval
.
Check Weka's documentation, especially subclasses of Classifier class
8.
Learning attributes are specified by the relationDefinition
parameter of the module train
in stfilter.plan
.
Observe attr/base.xml
¶
1.One that is called
length
.Weka requires that the predicted class is encoded as an attribute. In
base.xml
the class attribute is called class
and has the class
attribute set to yes
.More details about how to define learning attributes:
2.length
is a numeric attribute.
Its value is the number of words in all sections of the example document.
bag
generates a set of boolean attributes that behave like a bag of words:
prefix
specifies the prefix of all generated attributesloadValues
specifies the file where to find the list of words to test, there are as many generated attributes than lines in the specified filefeature
specifies what feature of the bag must be tested against the values inloadValues
The contents of bag
specifies the bag of words to consider for each example. In the case of attr/bow.xml
it is the set of words in all sections of the example document. In the case of attr/vici.xml
it is the set of words within a window of length 2 around gene names.
4.
Set the relationDefinition
parameter to attr/bow.xml
, then run AlvisNLP/ML and look for the results in output/stfilter.eval
.
Do the same for attr/vici.xml
and compare the results.
Change the following parameters:
fields
incommon.extract.words
tolemma
feature
inattr/bow.xml
tolemma
6.
Do your homework.
Using dependencies¶
1. and 2.
attachment:common_ccg.plan
3.
The following command will run the plan normally, but additionally dumps the corpus state in the output/parsed.dmp
file.alvisnlp -dumpModule common.parse output/parsed.dmp stfilter.plan
This corpus state can be used in following invocations of AlvisNLP/ML:alvisnlp -resume output/parsed.dmp stfilter.plan
Notice that all modules until common.parse
are skipped. This is useful to shorten the experiment cycle iff the plan does not change before common.parse
.
Command-line options can be listed by typing: alvisnlp -help
4.
The file attachment:vicidep.xml uses the bag of words attached to a dependency to a gene name.
Expressions can be tricky, see Element Expression and Element Expression Examples.