History

Cheating¶

Observe `stfilter.plan` and `common.plan`¶

1.
The main advantage of sequence is to organize the plan into sub-tasks. A plan with broken into sensible sequence may be easier to understand.

import allows to re-use plans inside more complete plans. A good library of importable plans saves time by avoiding to re-write boilerplate plans.

2.
Modules documentation will provide some information on each parameter.

Module	Parameter	File	Flow
common.read.positive	sourcePath	resources/Bsub_pos.txt	in
common.read.negative	sourcePath	resources/Bsub_neg.txt	in
common.named-entities.taxa	dictFile	/bibdev/ressources/entites_nommees/ncbi_mig_taxon_names/automatic/taxa_full.txt	in
common.named-entities.genes	dictFile	resources/genes.txt	in
common.pos	treeTaggerExecutable	/bibdev/install/tree-tagger-3.2/bin/tree-tagger	exe
common.pos	parFile	/bibdev/install/tree-tagger-3.2/lib/english.par	in
common.extract.taxa	outFile	output/taxa.txt	out
common.extract.words	outFile	output/words.txt	out
train	relationDefinition	attr/base.xml	in
train	classifierFile	output/stfilter.classifier	out
train	evaluationFile	output/stfilter.eval	out
train	arffFile	output/stfilter.arff	out
train	extractResults	output/prediction.txt	out

3.
The entryFeatureNames parameter specifies the feature names in which the 2nd, 3rd and 4th columns in the file /bibdev/ressources/entites_nommees/ncbi_mig_taxon_names/automatic/taxa_full.txt should be stored. The first column is the searched key. They represent the NCBI taxon identifier, the POS tag and the taxonomic rank of the taxon.

4.
Use ExpressionExtract, the same module type used to export taxon names and words.
In common.plan, inside the extract sequence, add the following module:

<module id="genes" class="ExpressionExtract">
  <target value="documents.sections.layer:genes"/>
  <fields value="form"/>
  <outFile value="output/genes.txt"/>
</module>

5.
The module fixed-forms prepares a layer (fixed-forms) that will be used later by the word segmentation to specify annotations that should not be broken. Usually these are used to ensure the atomicity of rigid designators.
The module fixed-forms-overlaps ensures that the fixed-forms layer does not contain overlapping annotations when words are segmented.

6.
The classified elements are specified by the examples parameter of the module train in stfilter.plan.
The classified elements are documents.

7.
Read output/stfilter.eval.
Check Weka's documentation, especially subclasses of Classifier class

8.
Learning attributes are specified by the relationDefinition parameter of the module train in stfilter.plan.

Observe `attr/base.xml`¶

1.
One that is called length.
Weka requires that the predicted class is encoded as an attribute. In base.xml the class attribute is called class and has the class attribute set to yes.
More details about how to define learning attributes:

2.
length is a numeric attribute.
Its value is the number of words in all sections of the example document.

3.
bag generates a set of boolean attributes that behave like a bag of words:

prefix specifies the prefix of all generated attributes
loadValues specifies the file where to find the list of words to test, there are as many generated attributes than lines in the specified file
feature specifies what feature of the bag must be tested against the values in loadValues

The contents of bag specifies the bag of words to consider for each example. In the case of attr/bow.xml it is the set of words in all sections of the example document. In the case of attr/vici.xml it is the set of words within a window of length 2 around gene names.

4.
Set the relationDefinition parameter to attr/bow.xml, then run AlvisNLP/ML and look for the results in output/stfilter.eval.
Do the same for attr/vici.xml and compare the results.

5.
Change the following parameters:

fields in common.extract.words to lemma
feature in attr/bow.xml to lemma

6.
Do your homework.

Using dependencies¶

1. and 2.
attachment:common_ccg.plan

3.
The following command will run the plan normally, but additionally dumps the corpus state in the output/parsed.dmp file.
alvisnlp -dumpModule common.parse output/parsed.dmp stfilter.plan

This corpus state can be used in following invocations of AlvisNLP/ML:
alvisnlp -resume output/parsed.dmp stfilter.plan
Notice that all modules until common.parse are skipped. This is useful to shorten the experiment cycle iff the plan does not change before common.parse.

Command-line options can be listed by typing: alvisnlp -help

4.
The file attachment:vicidep.xml uses the bag of words attached to a dependency to a gene name.

Expressions can be tricky, see Element Expression and Element Expression Examples.

Files

Project

General

Profile

Bibliome » AlvisNLP/ML

Wiki

Cheating¶

Observe `stfilter.plan` and `common.plan`¶

Observe `attr/base.xml`¶

Using dependencies¶

Project

General

Profile

Bibliome » AlvisNLP/ML

Wiki

Cheating¶

Observe stfilter.plan and common.plan¶

Observe attr/base.xml¶

Using dependencies¶

Observe `stfilter.plan` and `common.plan`¶

Observe `attr/base.xml`¶