lexextract: Tool for making derivation and lexicon
Japanese version
This tool is for making derivation trees and lexical entries from
parse trees.
lexextract [options] lexextract_module treebank derivbank lexicon template lexbank
|
lexextract_module | lilfes program in which inverse schemas
and inverse lexical rules are defined
|
treebank | input treebank (lildb format)
|
derivbank | file to output a derivbank (lildb format)
|
lexicon | file to output a lexicon (lildb format)
|
template | file to output lexical entry templates (lildb format)
|
lexbank | file to output derivation terminals (lildb format)
|
Options
|
-v | print debug messages
|
-vv | print many debug messages
|
This tool converts a treebank made by "treetrans" into a derivbank
(derivation trees of the target grammar theory). It also extracts
lexical entries from the derivbank.
First, write inverse schemas with the following interfaces, in order
to convert parse trees in an input treebank into derivation trees.
The inverse schemas are applied in the following order.
- Apply "root_constraints/1" to the value of "TREE_NODE\NODE_SIGN\"
of the root node of a parse tree.
root_constraints(-$Sign)
|
$Sign | sign of the root node
|
Unify $Sign with the sign of the root of the
derivation tree.
|
- To "TREE_NODE\NODE_SIGN\" of each node, apply
"inverse_schema_binary/4" or "inverse_schema_unary/3" in a topdown
way. The value of "TREE_NODE\SCHEMA_NAME\" is used as the name of a
schema. Schemas are applied in a depth-first order.
inverse_schema_binary(+$SchemaName, +$Mother,
-$Left, -$Right)
|
$SchemaName | schema name
|
$Mother | sign of the mother
|
$Left | sign of the left daughter
|
$Right | sign of the right daughter
|
Apply a binary schema to $Mother and obtain
daughter signs.
|
inverse_schema_unary(+$SchemaName, +$Mother,
-$Dtr)
|
$SchemaName | schema name
|
$Mother | sign of the mother
|
$Dtr | sign of the daughter
|
Apply a unary schema to $Mother, and obtain a
daughter sign.
|
- After applying inverse schemas to all internal nodes, apply
"lexical_constraints/2" to "TREE_NODE\NODE_SIGN\" of terminal nodes.
Since this is done after all applications of inverse schemas, you
can coerce default constraints using this interface.
lexical_constraints(+$Word, -$Sign)
|
$Word | feature structure representing a word (the value of
"TREE_NODE\WORD\")
|
$Sign | sign of a terminal node
|
Unify $Sign with the sign of a terminal node.
|
A derivation tree made by the above process is represented with a
feature structure defined in "derivtypes.lil". A list of terminal
nodes is stored in "lexbank".
Next, from terminal nodes of derivation trees, extract lexical entry
templates and mappings from a word into lexical entry templates.
Interfaces for lexicon extraction are defined in "lexextract.lil".
The extraction algorithm is presented below.
In each of the following steps, the target feature structures are
copied. This means that even when you modify the target feature
structures with new constraints or destructive operations the
modifications will not affect derivation trees nor other lexical
entries.
- Apply "lexical_entry_template/3" to "DERIV_SIGN\" of each
terminal node of a derivation tree. The result is stored in
"LEXENTRY_SIGN\" of the derivation tree.
lexical_entry_template(+$Word, +$Sign, -$Template)
|
$Word | feature structure representing a word
|
$Sign | lexical sign
|
$Template | lexical entry template
|
Make a lexical entry template $Template from
lexical sign $Sign of the word $Word.
|
- Apply "reduce_lexical_template/5" to the lexical entry template,
and obtain a key to look up a lexicon, a sign of a lexeme, and a
history of lexical rule applications. The obtained lexeme will be
stored in a template database. Lexeme signs are also stored in
"LEXEME_SIGN\" of derivation trees.
reduce_lexical_template(+$Word, +$InTemplate, -$Key,
-$OutTemplate, -$LexRules)
|
$Word | feature structure representing a word
|
$InTemplate | input lexical entry template (the output of
"lexical_entry_template/3")
|
$Key | key to look up a lexicon
|
$OutTemplate | sign of a lexeme
|
$LexRules | a list of applied lexical rules
|
Obtain a sign of a lexeme by inversely applying
lexical rules to a lexical entry template obtained by
"lexical_entry_template/3"
|
- If a lexeme sign is not stored in the database yet, i.e., it is
first to see, apply "lexeme_name/4" to the lexeme sign to obtain the
name of a lexeme. The pair of this name and the history of the
application of lexical rules will be the name of a lexical entry
template. A mapping from a key to a lexical entry template is
stored in a lexicon database. Template names are stored in
"TERM_TEMPLATE\" of derivation trees.
lexeme_name(+$Word, +$Template, +$ID, -$Name)
|
$Word | feature structure representing a word
|
$Template | sign of a lexeme
|
$ID | identification number (integer)
|
$Name | name of a lexeme (string)
|
Assign a unique name to a lexeme
|
- Increment the occurrence count of a word. Occurrence counts
will be used for cutting off infrequent words in "lexrefine".
word_count_key(+$LexKey, -$CountKey)
|
$LexKey | key to look up a lexicon
|
$CountKey | key to be used for counting a word
|
Obtain a key to count the occurrence of a word.
If you want to count different keys as an identical word, implement
this predicate to return the same $CountKey for the different keys.
|
Finally, a lexicon and a template database are stored in files.
MAYZ Toolkit Manual
MAYZ Home Page
Tsujii Laboratory
MIYAO Yusuke (yusuke@is.s.u-tokyo.ac.jp)