This is a tool for making event files of feature forest models.
forestmaker model_name grammar_module derivbank event_file | |
model_name | name of a probabilistic model (this will be used in parsing) |
grammar_module | lilfes program in which a grammar and predicates for event extraction are implemented |
derivbank | derivbank obtained by "lexextract" (lildb format) |
event_file | file to output unfiltered events (text format or compressed (gz or bz) format) |
Options | |
-r file_name | file to output reference distribution |
-n threshold | limit number of events to be output |
-v | print debug messages |
-vv | print many debug messages |
The name of a probabilistic model must be assigned to each event file. This means that by assigning different names, you can use multiple models in parsing. For example, if you incorporate a unigram model as a reference distribution into a feature forest model, you must assign different names to the models.
This tool supports the construction of a maximum entropy model of a derivation, given a grammar and a derivbank. This tool makes unfiltered events that will be required for the estimation of a probabilistic model.
An unfiltered event is a string that has several fields separated by "//". An example is as follows.
SUBJ//plays//VBZ//[npVPnp]//haag//NNP//[NP]_2//binary
The last field ("binary") denotes the category of this event. A category will be used in the later steps, such as for applying masks to the events. Events that have the same category name must have the same number of fieds, since the same masks are applied to them. This means that you must use different category names for events that have different number of fields. For example, the numbers of fields must be different for binary and unary rule applications, because they should be represented with the different number of fields.
An unfiltered event represents a derivation forest for a sentence with a feature forest format. The model estimation requires derivation forests for all sentences in a training data (i.e., derivbank), the tool parses all sentences and outputs forests of probabilistic events by extracting probabilistic events for each node in derivation forests. Hence, this tool requires the implementaiton of the interfaces for parsing and for extracting probabilistic events from derivations.
First, in order to parse sentences, you must implement the interfaces defined in "UP" (such as id_schema_binary). For details, see "How to use a grammar" and the manual of UP.
In addition, you must implement the following predicates defined in "mayz/forestmake.lil", which substitute for sentence_to_word_lattice/2 and lexical_entry/2.
fm_derivation_to_word_lattice(+$Derivation, -$WordLattice) | |
$Derivation | derivation |
$WordLattice | word lattice (list of 'extent') |
Make a word lattice from a derivation. |
fm_lexical_entry(+$Lex, -$LexName) | |
$Lex | input word and the named of a template that will be assigned to the word (lex_entry) |
$LexName | LEX_NAME (the second argument of 'lexical_entry/2') |
Provide lexical entries that are assinged to a word. |
The above predicates may be implemented like "sentence_to_word_lattice/2" and "lexical_entry/2". However, they provide us information that is necessary for the correct derivation, and this information may be exploited. For example, since "fm_lexical_entry/2" gives the name of the correct lexical entry, we can cut off lexical entries with low probabilities by returning the correct lexical entry and other lexical entries with high probabilities. This technique greatly reduces the time for parsing training sentences, and hence for making an event file. Note that correct lexical entries must be included in assigned lexical entries because a derivation forest must include a correct derivation tree.
The following predicte must also be implemented to make correct derivation trees. While derivations in a derivbank are used for making correct derivation trees, the following predicate is necessary for providing lexical entries corresponding to terminal nodes.
fm_correct_lexical_entry(+$Term, -$LexName) | |
$Term | terminal node of a derivation (derivation_terminal) |
$LexName | LEX_NAME (the second argument of lexical_entry/2) |
Returns the correct lexical entry corresponding to a terminal node of a derivation. |
Next, the following interfaces defined in "mayz/amismodel.lil" are required for extracting probabilistic events. They extract an event from each node in a derivation forest. An event is represented as a list of strings. "forestmaker" calls these predicates for each node in a derivation forest, and the results are output into an event file in a feature forest format.
extract_terminal_event(+$ModelName, -$Category, +$LexName, +$Sign, +$SignPlus, -$Event) | |
$ModelName | name of a probabilistic model |
$Category | name of a category |
$LexName | LEX_NAME (the second argument of "lexical_entry/2") |
$Sign | lexical entry |
$SignPlus | SIGN_PLUS (the third argument of "reduce_sign/3") |
$Event | event (a list of strings) |
Extract an event of a terminal node. |
extract_unary_event(+$ModelName, -$Category, +$SchemaName, +$Dtr, +$Mother, +$SignPlus, -$Event) | |
$ModelName | name of a probabilistic model |
$Category | name of a category |
$SchemaName | name of a schema |
$Dtr | daughter sign |
$Mother | mother sign |
$SignPlus | SIGN_PLUS (the third argument of "reduce_sign/3") |
$Event | event (a list of strings) |
Extract an event of a unary rule application. |
extract_binary_event(+$ModelName, -$Category, +$SchemaName, +$LeftDtr, +$RightDtr, +$Mother, +$SignPlus, -$Event) | |
$ModelName | name of a probabilistic model |
$Category | name of a category |
$SchemaName | name of a schema |
$LeftDtr | sign of a left daughter |
$RightDtr | sign of a right daughter |
$Mother | sign of a mother |
$SignPlus | SIGN_PLUS (the third argument of "reduce_sign/3") |
$Event | event (a list of strings) |
Extract an event of a binary rule application. |
extract_root_event(+$ModelName, -$Category, +$Sign, -$Event) | |
$ModelName | name of a probabilistic model |
$Category | name of a category |
$Sign | name of a schema |
$Event | sign of a root node |
Extract an event of a root node. |
The name of a probabilistic model must be the same as the first command-line argument of "forestmaker".
For each interface, we also provide a version in which the value of a feature function (integer or float) can be specified. Add the feature value as the last argument.
"forestmaker" allows for the development of an event file with a reference distribution. Specify the file name of a reference distribution in the "-r" option, and implement the following interfaces.
reference_prob_terminal(+$ModelName, +$LexName, +$Sign, +$SignPlus, -$Prob) | |
$ModelName | name of a probabilistic model |
$LexName | LEX_NAME (the second argument of "lexical_entry/3") |
$Sign | sign of a terminal node |
$SignPlus | SIGN_PLUS (the third argument of "reduce_sign/3") |
$Prob | reference probability of a terminal node |
Returns a reference probability of a terminal node. |
reference_prob_unary(+$ModelName, +$SchemaName, +$Dtr, +$Mother, +$SignPlus, -$Prob) | |
$ModelName | name of a probabilistic model |
$SchemaName | name of a schema |
$Dtr | daughter sign |
$Mother | mother sign |
$SignPlus | SIGN_PLUS (the third argument of "reduce_sign/3") |
$Prob | reference probability |
Returns a reference probability of a unary rule application. |
reference_prob_binary(+$ModelName, +$SchemaName, +$LeftDtr, +$RightDtr, +$Mother, +$SignPlus, -$Prob) | |
$ModelName | name of a probabilistic model |
$SchemaName | name of a schema |
$LeftDtr | sign of a left daughter |
$RightDtr | sign of a right daughter |
$Mother | sign of a mother |
$SignPlus | SIGN_PLUS (the third argument of "reduce_sign/3") |
$Prob | reference probability |
Returns a reference probability of a binary rule application. |
reference_prob_root(+$ModelName, +$Sign, -$Prob) | |
$ModelName | name of a probabilistic model |
$Sign | sign of a root node |
$Prob | reference probability |
Returns a reference probability of a root node. |