This page is obsolete, please go to https://github.com/Bibliome/alvisnlp/wiki/Writing-plans¶
Writing Plans (!) WIP¶
- Table of contents
- This page is obsolete, please go to https://github.com/Bibliome/alvisnlp/wiki/Writing-plans
- Writing Plans (!) WIP
The plan file is an XML file that specifies the sequence of modules and their parameters.
The basics¶
The top-level tag of a plan file is alvisnlp-plan
:
<alvisnlp-plan id="foo">
...
</alvisnlp>
The id
attribute is mandatory, its value is the identifier of the plan. Make sure to fill a value that has some meaning for you.
Then, the alvisnlp-plan
contains several module
tags:
<alvisnlp-plan id="foo">
<module id="read" class="TextFileReader">
...
</module>
<module id="words" class="WoSMig">
...
</module>
...
</alvisnlp>
Each module
tag specifies an AlvisNLP/ML module that will process the corpus. The id
and class
attributes are both mandatory:
id
specifies the identifier of the module in this plan. As for the plan identifier, choose something meaningful. The module identifier is unique, there must not be two modules in a plan with the same identifier.
class
specifies the class of the module, that is what it does on the corpus. The value must be a supported module class.
Finally, each module
tag contains tags that specify the parameter values for the module:
<alvisnlp-plan id="foo">
<module id="read" class="TextFileReader">
<sourcePath>...</sourcePath>
<sectionName>contents</sectionName>
</module>
<module id="words" class="WoSMig"/>
...
</alvisnlp>
In this example, we set two parameters in the module read
: sourcePath
and sectionName
, and no parameter in words
. The module class documentation should specify which parameters are supported, which are mandatory and what are the default values.
Parameter value conversion¶
The type of a parameter and conversion of the contents of the tag into the type are documented in the module class description. In this section we review the most used parameter types:
String¶
The contents of the tag must be a character string. Leading and trailing whitespaces are trimmed.
Integer¶
The contents of the tag must be a character string. The conversion is in base 10, with an optional a leading sign symbol (+
or -
). Leading and trailing whitespaces are trimmed.
Boolean¶
The contents of the tag must be a character string.
true values | false values |
true |
false |
on |
off |
yes |
no |
Leading and trailing whitespaces are trimmed.
SourceStream and TargetStream¶
A SourceStream
and TargetStream@represents a file, a directory or an internet resource. @SourceStream
are input resources, while TargetStream
represent outputs. The contents of the tag is either a path in the local filesystem or an URL to an remote resource. The conversion supports the main internet protocols (http
, https
, ftp
) or the standard streams (stdin
, stdout
, stderr
). It also supports the concatenation of several resources, the specification of all files in a directory, with filename filters. It also supports compression schemes.
File , InputFile , InputDirectory , OutputDirectory , OutputFile , ExecutableFile¶
These types represent resources in the local filesystem, their value cannot be remote URLs. AlvisNLP/ML will check the values according to the type:
parameter type | file exists | file type | permissions |
InputDirectory |
yes | directory | rx |
InputFile |
yes | regular | r |
OutputDirectory |
yes 1 | directory 1 | rwx 1 |
OutputFile |
yes 1 | directory or regular 2 | rwx 2 |
ExecutableFile |
yes | regular | rx |
File |
no | regular 3 |
1 some ancestor must exist and be a writable directory
2 if the file exists then it must be regular and writable, the innermost existing ancestor must be a writable directory
3 if the file exists then it mus be regular
Expression¶
Expression
parameters are evaluated only when the module is processing. Expressions allow to set values that depend on the state of the corpus. The value of the tag must follow the expression language syntax.
Arrays¶
Array parameter types can be recognized by a pair of brackets at the end of the type ([]
). An array value is a sequence of values of the same type. You can either specify the values separated with commas (,
), or each inside an enclosed tag (of arbitrary name).
Mappings¶
Mapping parameter types map strings to values, where all values of the mapping are of the same type. The mapping can be specified by separating each entry by commas (,
), and by separating the key from the value with an equal sign (=
). Alternatively each entry can be specified with a tag, whose name is the entry key, end the contents is converted into a value.
Sequences¶
Sequences are sub-parts of the plan that contain modules (or other sequences):
<alvisnlp-plan id="foo">
<module id="read" class="TextFileReader">
<sourcePath>...</sourcePath>
<sectionName>contents</sectionName>
</module>
<sequence id="segmentation">
<module id="words" class="WoSMig"/>
<module id="sentences" class="SeSMig"/>
</sequence>
...
</alvisnlp>
The id
attribute is mandatory and unique: it cannot be the same as a module in the same sequence.
Sequences do not alter the order of processing, their purpose is the organization of modules in logical bundles. Note that sequences affect the logging and may help you to read the AlvisNLP/ML log.
Plan import¶
Plans can be reused inside other plans:
<alvisnlp-plan id="foo">
<module id="read" class="TextFileReader">
<sourcePath>...</sourcePath>
<sectionName>contents</sectionName>
</module>
<import file="/path/to/another/plan.xml"/>
...
</alvisnlp>
In this example, all the modules specified in /path/to/another/plan.xml
will process the corpus as if the plan file had been included.
Plan-level parameters¶
You can define parameters for the whole plan, so you can set these parameters when the plan is imported.
A Plan-level parameter looks like this in project_species.xml
:
<param name="speciesFile">
<alias module="project.species" param="dictFile" />
</param>
<sequence id="project">
<module id="species" class="SimpleContentsProjector">
<!-- ... snip ... -->
</module>
</sequence>
An import of this file could look like:
<import file="project_species.xml">
<param name="speciesFile" value="/bibdev/resources/..."/>
</import>
This will import the plan specified in project_species.xml
and set the parameter speciesFile
. Since speciesFile
is defined as an alias to the dictFile
parameter in project.species
, then it will be set.
A Plan-level parameter can be an alias for several parameters in different modules. When importing and setting these parameters, all aliases will have the same value. You have no excuses left: make modular plans!
More on parameters¶
The parameter tag may have attributes that change the conversion:
inhibitCheck="true" |
prevents AlvisNLP/ML from checking this parameter value, for instance it will not check for the existence of InputFile parameters. |
separator="C" |
sets the separator character between array elements or mapping entries (default: , ). |
qualifier="C" |
sets the separator character between the key and the value of a mapping entry (default: = ). |
trim="false" |
prevents AlvisNLP/ML from trimming leading and trailing whitespaces off the parameter value. |
load="..." |
loads the specified file. This file must be an XML file, AlvisNLP/ML sets the parameter value as if the parameter tag was the root element of this file. This attribute is useful for complex parameter values. |
More attributes may be supported for the conversion to specific types.
Command-line control¶
The plan, especially parameter values, can be controlled from the command line.
-param
¶
alvisnlp -param MODULE PARAM VALUE
The -param
option sets the value of parameter PARAM
in MODULE
. MODULE
is the identifier of a module specified in the plan. If the module is inside a sequence, then its identifier is in the form SEQUENCE.MODULE
.
The VALUE
is a string and it is converted as if it was the contents of the parameter tag.
-xparam
¶
alvisnlp -param MODULE PARAM XVALUE
The -xparam
option behaves the same way as -param
but it expects an XML tag instead of a string value. This option is useful if you want to set parameters with conversion options.
-feat
¶
alvisnlp -feat KEY VALUE
The -feat
options adds a feature pair to the corpus before the processing starts. Expression parameters can get the value of this feature to alter the behavior of the modules.
-entity
¶
alvisnlp -entity NAME REPLACEMENT
This option defines an XML entity that is used when the plan file is parsed.
-environmentEntities
¶
alvisnlp -environmentEntities
This option defines an XML entity for each environment variable.