lexextract: Tool for making derivation and lexicon

This tool is for making derivation trees and lexical entries from parse trees.

lexextract [options] lexextract_module treebank derivbank lexicon template lexbank
lexextract_module	lilfes program in which inverse schemas and inverse lexical rules are defined
treebank	input treebank (lildb format)
derivbank	file to output a derivbank (lildb format)
lexicon	file to output a lexicon (lildb format)
template	file to output lexical entry templates (lildb format)
lexbank	file to output derivation terminals (lildb format)
Options
-v	print debug messages
-vv	print many debug messages

This tool converts a treebank made by "treetrans" into a derivbank (derivation trees of the target grammar theory). It also extracts lexical entries from the derivbank.

First, write inverse schemas with the following interfaces, in order to convert parse trees in an input treebank into derivation trees. The inverse schemas are applied in the following order.

Apply "root_constraints/1" to the value of "TREE_NODE\NODE_SIGN\" of the root node of a parse tree.

root_constraints(-$Sign)
$Sign sign of the root node
Unify $Sign with the sign of the root of the derivation tree.

To "TREE_NODE\NODE_SIGN\" of each node, apply "inverse_schema_binary/4" or "inverse_schema_unary/3" in a topdown way. The value of "TREE_NODE\SCHEMA_NAME\" is used as the name of a schema. Schemas are applied in a depth-first order.

inverse_schema_binary(+$SchemaName, +$Mother, -$Left, -$Right)
$SchemaName	schema name
$Mother	sign of the mother
$Left	sign of the left daughter
$Right	sign of the right daughter
Apply a binary schema to $Mother and obtain daughter signs.

inverse_schema_unary(+$SchemaName, +$Mother, -$Dtr)
$SchemaName	schema name
$Mother	sign of the mother
$Dtr	sign of the daughter
Apply a unary schema to $Mother, and obtain a daughter sign.

After applying inverse schemas to all internal nodes, apply "lexical_constraints/2" to "TREE_NODE\NODE_SIGN\" of terminal nodes. Since this is done after all applications of inverse schemas, you can coerce default constraints using this interface.

lexical_constraints(+$Word, -$Sign)
$Word feature structure representing a word (the value of "TREE_NODE\WORD\")
$Sign sign of a terminal node
Unify $Sign with the sign of a terminal node.

A derivation tree made by the above process is represented with a feature structure defined in "derivtypes.lil". A list of terminal nodes is stored in "lexbank".

Next, from terminal nodes of derivation trees, extract lexical entry templates and mappings from a word into lexical entry templates. Interfaces for lexicon extraction are defined in "lexextract.lil". The extraction algorithm is presented below. In each of the following steps, the target feature structures are copied. This means that even when you modify the target feature structures with new constraints or destructive operations the modifications will not affect derivation trees nor other lexical entries.

Apply "lexical_entry_template/3" to "DERIV_SIGN\" of each terminal node of a derivation tree. The result is stored in "LEXENTRY_SIGN\" of the derivation tree.

lexical_entry_template(+$Word, +$Sign, -$Template)
$Word	feature structure representing a word
$Sign	lexical sign
$Template	lexical entry template
Make a lexical entry template $Template from lexical sign $Sign of the word $Word.

Apply "reduce_lexical_template/5" to the lexical entry template, and obtain a key to look up a lexicon, a sign of a lexeme, and a history of lexical rule applications. The obtained lexeme will be stored in a template database. Lexeme signs are also stored in "LEXEME_SIGN\" of derivation trees.

reduce_lexical_template(+$Word, +$InTemplate, -$Key, -$OutTemplate, -$LexRules)
$Word	feature structure representing a word
$InTemplate	input lexical entry template (the output of "lexical_entry_template/3")
$Key	key to look up a lexicon
$OutTemplate	sign of a lexeme
$LexRules	a list of applied lexical rules
Obtain a sign of a lexeme by inversely applying lexical rules to a lexical entry template obtained by "lexical_entry_template/3"

If a lexeme sign is not stored in the database yet, i.e., it is first to see, apply "lexeme_name/4" to the lexeme sign to obtain the name of a lexeme. The pair of this name and the history of the application of lexical rules will be the name of a lexical entry template. A mapping from a key to a lexical entry template is stored in a lexicon database. Template names are stored in "TERM_TEMPLATE\" of derivation trees.

lexeme_name(+$Word, +$Template, +$ID, -$Name)
$Word	feature structure representing a word
$Template	sign of a lexeme
$ID	identification number (integer)
$Name	name of a lexeme (string)
Assign a unique name to a lexeme

Increment the occurrence count of a word. Occurrence counts will be used for cutting off infrequent words in "lexrefine".

`word_count_key(+$LexKey, -$CountKey)`
$LexKey	key to look up a lexicon
$CountKey	key to be used for counting a word
Obtain a key to count the occurrence of a word. If you want to count different keys as an identical word, implement this predicate to return the same $CountKey for the different keys.

Finally, a lexicon and a template database are stored in files.

MAYZ Toolkit Manual MAYZ Home Page Tsujii Laboratory

MIYAO Yusuke (yusuke@is.s.u-tokyo.ac.jp)

root_constraints(-$Sign)
$Sign	sign of the root node
Unify $Sign with the sign of the root of the derivation tree.

lexical_constraints(+$Word, -$Sign)
$Word	feature structure representing a word (the value of "TREE_NODE\WORD\")
$Sign	sign of a terminal node
Unify $Sign with the sign of a terminal node.