Project

General

Profile

This page is obsolete, please go to https://github.com/Bibliome/alvisnlp/wiki/Element-classifier-recipes

Element Classifiers Recipes and Questions

This page provides pragmatic insights about the generic Weka wrapper modules, for complete parameter description refer to the reference documentation of TrainingElementClassifier", TaggingElementClassifier and SelectingElementClassifier.

Overview

  1. decide the target elements
  2. writing a relation definition
  3. select attributes with SelectingElementClassifier
  4. training a classifier on training elements with TrainingElementClassifier
  5. using the classifier to tag elements with TaggingElementClassifier

Target elements

The target elements are the elements you want to classify, these are specified by the example parameter in each of the three modules. It is an Element Expression evaluated as a list of elements with the corpus as the context element. The resulting collection of elements will be the training set in TrainingElementClassifier and SelectingElementClassifier, or the elements to predict the class in TaggingElementClassifier.

Target examples

Documents

  documents

To restrict the target to only some documents, for instance the training set:

  documents(set == "train")

This assumes that documents have a feature with key set and an appropriate value. For instance, this feature could have been added by the reader module that loaded some files into the corpus.

Annotations

  documents.sections.layer:sentences

This assumes a layer named sentences that contains annotations representing sentences. For instance this layer could have been filled with SeSMig.

To restrict the target to only some sentences, for instance those that contain at least two gene names:

  documents.sections.layer:sentences(inside:genes >= 2)

This assumes a layer named genes containing all gene names acquired from previous modules.

Now for NER tasks, you may want to classify annotation n-grams, then you'd use the NGrams module:

  <module id="tokenize" class="OgmiosTokenizer">
    <param name="tokenTypeFeature">type</param>
    <param name="separatorTokens">false</param>
    <param name="targetLayerName">tokens</param>
  </module>

  <module id="ngrams" class="NGrams">
    <param name="targetLayerName">ngrams</param>
    <param name="tokenLayerName">tokens</param>
    <param name="maxNGramSize">3</param>
  </module>

Tuples

Why not?

  documents.sections.relations:genePairs.tuples

This assumes a relation named genePairs. Note that all gene name pairs in a sentence can be generated with the module CartesianProductTuples like this:

  <module id="genePairs" class="CartesianProductTuples">
    <param name="anchor">documents.sections.layer:sentences</param>
    <param name="relationName">genePairs</param>
    <param name="arguments">
      <first>inside:genes</second>
      <second>inside:genes</second>
    </param>
  </module>

Of course, you need to adjust the target so that your classifier does not attempt to classify pairs of the same gene:

  documents.sections.relations:genePairs.tuples(args:first != args:second)

Relations

AlvisNLP/ML indulges the likes of you:

  documents.sections.relations:myrelation

This assumes a relation named myrelation and that it is effectively what you wish to classify.

Relation definition

Here, relation is used in the meaning of Weka, it does not mean AlvisNLP/ML's relations.

The relation definition is specified by the relationDefinition parameter in the three modules:

  <param name="relationDefinition">
    <relation name="myrelation">
      attribute and bag definitions
    </relation>
  </param>

However we recommend to place the relation subtree in a separate file and invoke it like this:

  <param name="relationDefinition" load="myfile">
    <relation name="myrelation">
      attribute and bag definitions
    </relation>
  </param>

Indeed it is important you use the same relation definition in the three modules.

The relation name is optional and doesn't actually make a difference at all.

Attributes

Each attribute is specified with an attribute tag:

  <attribute
    name="NAME" 
    type="TYPE" 
    class="CLASS">
    EXPR
  </attribute>
  • NAME is the name of the attribute, it is mandatory and must be unique in the relation.
  • TYPE is the type of the attribute and can take either one of three values: bool, int or nominal. If the type is omited, then it is bool by default. If the type is nominal, then the attribute definition must also specify all possible values:
  <attribute
    name="NAME" 
    type="nominal" 
    class="CLASS" 
    value="EXPR">
    <value>value1</value>
    <value>value2</value>
    ...
  </attribute>

Note the alternative way to specify EXPR.

  • CLASS is a boolean (values allowed: true, false, yes and no); it indicates either the attribute is the class attribute, that is to say either if the attribute is the one predicted by the classifier. If omitted then the attribute is not the class attribute by default. There must be one and only one class attribute in the relation definition.
  • EXPR is an expression that specifies the value of the attribute for a given example element. To compute the value of the attribute for a given element, AlvisNLP/ML evaluates EXPR with the element as the context element. The type of the evaluation depends on the type of the attribute:
Attribute type Evaluation type
bool boolean
int number
nominal string

If a nominal value evaluates to a string different from all declared possible values then AlvisNLP/ML will issue an error.

Attribute Examples

All-uppercase word

  <attribute name="allcaps" type="bool">@form =~ "^[A-Z]$"</attribute>

Number of words in sentence

  <attribute name="wordcount" type="int">inside:words</attribute>

Do not count punctuations:

  <attribute name="wordcount" type="int">inside:words[@type != "punctuation"]</attribute>

This assumes that words have a feature type indicating the word type (see WoSMig annotationTypeFeature parameter).

POS category of word

  <attribute name="wordcount" type="nominal" value='@pos =~ "^."'>
    <value>N</value>
    <value>V</value>
    <value>J</value>
    <value>R</value>
    <value>D</value>
  </attribute>

Bags

Bags are attribute generators mainly used to emulate bag-of-word representations.

  <bag
    prefix="PREFIX" 
    key="KEY" 
    count="COUNT" 
    loadValues="FILE">
    EXPR
  </bar>
  • PREFIX is the prefix of all generated attribute names, it is mandatory and xhoose it wisely so it does not create a name clash with other attributes.
  • KEY is a feature name
  • COUNT is a boolean value that specifies the type of the generated attributes:
false boolean presence
true number count
  • FILE is the path to a file containing all forms of the bags, it is an UTF-8 encoded file with one value per line. AlvisNLP/ML generates one attribute for each entry.
  • EXPR is an expression evaluated as a list of elements with the example as the context element. For each element in the result, the value of feature KEY sets or increments the corresponding attribute (depending on COUNT).

Bag examples

Document word vector

  <bag prefix="w__" key="lemma" count="yes" loadValues="words.txt">sections.layer:words</bar>

You may generate words.txt with ExpressionExtract:

  <module id="words.txt" class="ExpressionExtract">
    <target value="documents.sections.layer:words"/>
    <fields value="lemma"/>
    <outFile value="words.txt"/>
  </module>

Syntactic dependency argument

  <bag prefix="syn__" key="lemma" loadValues="words.txt">tuple:dependencies:head.args:dependent</bar>