This page is obsolete, please go to https://github.com/Bibliome/alvisnlp/wiki/Element-classifier-recipes¶
- Table of contents
- This page is obsolete, please go to https://github.com/Bibliome/alvisnlp/wiki/Element-classifier-recipes
- Element Classifiers Recipes and Questions
Element Classifiers Recipes and Questions¶
This page provides pragmatic insights about the generic Weka wrapper modules, for complete parameter description refer to the reference documentation of TrainingElementClassifier", TaggingElementClassifier and SelectingElementClassifier.
Overview¶
- decide the target elements
- writing a relation definition
- select attributes with
SelectingElementClassifier
- training a classifier on training elements with
TrainingElementClassifier
- using the classifier to tag elements with
TaggingElementClassifier
Target elements¶
The target elements are the elements you want to classify, these are specified by the example
parameter in each of the three modules. It is an Element Expression evaluated as a list of elements with the corpus as the context element. The resulting collection of elements will be the training set in TrainingElementClassifier
and SelectingElementClassifier
, or the elements to predict the class in TaggingElementClassifier
.
Target examples¶
Documents¶
documents
To restrict the target to only some documents, for instance the training set:
documents(set == "train")
This assumes that documents have a feature with key set
and an appropriate value. For instance, this feature could have been added by the reader module that loaded some files into the corpus.
Annotations¶
documents.sections.layer:sentences
This assumes a layer named sentences
that contains annotations representing sentences. For instance this layer could have been filled with SeSMig.
To restrict the target to only some sentences, for instance those that contain at least two gene names:
documents.sections.layer:sentences(inside:genes >= 2)
This assumes a layer named genes
containing all gene names acquired from previous modules.
Now for NER tasks, you may want to classify annotation n-grams, then you'd use the NGrams module:
<module id="tokenize" class="OgmiosTokenizer">
<param name="tokenTypeFeature">type</param>
<param name="separatorTokens">false</param>
<param name="targetLayerName">tokens</param>
</module>
<module id="ngrams" class="NGrams">
<param name="targetLayerName">ngrams</param>
<param name="tokenLayerName">tokens</param>
<param name="maxNGramSize">3</param>
</module>
Tuples¶
Why not?
documents.sections.relations:genePairs.tuples
This assumes a relation named genePairs
. Note that all gene name pairs in a sentence can be generated with the module CartesianProductTuples like this:
<module id="genePairs" class="CartesianProductTuples">
<param name="anchor">documents.sections.layer:sentences</param>
<param name="relationName">genePairs</param>
<param name="arguments">
<first>inside:genes</second>
<second>inside:genes</second>
</param>
</module>
Of course, you need to adjust the target so that your classifier does not attempt to classify pairs of the same gene:
documents.sections.relations:genePairs.tuples(args:first != args:second)
Relations¶
AlvisNLP/ML indulges the likes of you:
documents.sections.relations:myrelation
This assumes a relation named myrelation
and that it is effectively what you wish to classify.
Relation definition¶
Here, relation is used in the meaning of Weka, it does not mean AlvisNLP/ML's relations.
The relation definition is specified by the relationDefinition
parameter in the three modules:
<param name="relationDefinition">
<relation name="myrelation">
attribute and bag definitions
</relation>
</param>
However we recommend to place the relation
subtree in a separate file and invoke it like this:
<param name="relationDefinition" load="myfile">
<relation name="myrelation">
attribute and bag definitions
</relation>
</param>
Indeed it is important you use the same relation definition in the three modules.
The relation name is optional and doesn't actually make a difference at all.
Attributes¶
Each attribute is specified with an attribute
tag:
<attribute
name="NAME"
type="TYPE"
class="CLASS">
EXPR
</attribute>
NAME
is the name of the attribute, it is mandatory and must be unique in the relation.
TYPE
is the type of the attribute and can take either one of three values:bool
,int
ornominal
. If the type is omited, then it isbool
by default. If the type isnominal
, then the attribute definition must also specify all possible values:
<attribute
name="NAME"
type="nominal"
class="CLASS"
value="EXPR">
<value>value1</value>
<value>value2</value>
...
</attribute>
Note the alternative way to specify EXPR
.
CLASS
is a boolean (values allowed:true
,false
,yes
andno
); it indicates either the attribute is the class attribute, that is to say either if the attribute is the one predicted by the classifier. If omitted then the attribute is not the class attribute by default. There must be one and only one class attribute in the relation definition.
EXPR
is an expression that specifies the value of the attribute for a given example element. To compute the value of the attribute for a given element, AlvisNLP/ML evaluatesEXPR
with the element as the context element. The type of the evaluation depends on the type of the attribute:
Attribute type | Evaluation type |
bool |
boolean |
int |
number |
nominal |
string |
If a nominal value evaluates to a string different from all declared possible values then AlvisNLP/ML will issue an error.
Attribute Examples¶
All-uppercase word¶
<attribute name="allcaps" type="bool">@form =~ "^[A-Z]$"</attribute>
Number of words in sentence¶
<attribute name="wordcount" type="int">inside:words</attribute>
Do not count punctuations:
<attribute name="wordcount" type="int">inside:words[@type != "punctuation"]</attribute>
This assumes that words have a feature type
indicating the word type (see WoSMig annotationTypeFeature parameter).
POS category of word¶
<attribute name="wordcount" type="nominal" value='@pos =~ "^."'>
<value>N</value>
<value>V</value>
<value>J</value>
<value>R</value>
<value>D</value>
</attribute>
Bags¶
Bags are attribute generators mainly used to emulate bag-of-word representations.
<bag
prefix="PREFIX"
key="KEY"
count="COUNT"
loadValues="FILE">
EXPR
</bar>
PREFIX
is the prefix of all generated attribute names, it is mandatory and xhoose it wisely so it does not create a name clash with other attributes.
KEY
is a feature name
COUNT
is a boolean value that specifies the type of the generated attributes:
false |
boolean | presence |
true |
number | count |
FILE
is the path to a file containing all forms of the bags, it is an UTF-8 encoded file with one value per line. AlvisNLP/ML generates one attribute for each entry.
EXPR
is an expression evaluated as a list of elements with the example as the context element. For each element in the result, the value of featureKEY
sets or increments the corresponding attribute (depending onCOUNT
).
Bag examples¶
Document word vector¶
<bag prefix="w__" key="lemma" count="yes" loadValues="words.txt">sections.layer:words</bar>
You may generate words.txt
with ExpressionExtract:
<module id="words.txt" class="ExpressionExtract">
<target value="documents.sections.layer:words"/>
<fields value="lemma"/>
<outFile value="words.txt"/>
</module>
Syntactic dependency argument¶
<bag prefix="syn__" key="lemma" loadValues="words.txt">tuple:dependencies:head.args:dependent</bar>