The next step in the development cycle of Enju grammar is the acquisition of a lexicon and a template database based on the phrase structure tree transformed by the previous step. The dictionary used is a mapping between lexical information (of type 'word') and names of lexical entry templates (of type 'lex_entry'). The template database is a mapping between names of lexical entry templates and feature structures of lexical entry templates (of type 'hpsg_word')>
To be more precise, what is first done in this step is to come up with the derivation of a phrase based on HPSG using the transformed phrase structure tree as input. Next, a dictionary and a template database is acquired by extracting from the leave of such derivation. To refine the dictionary and the template database, entries with low frequency are removed and entries of unknown words are created. Derivation and the acquisition of the dictionary and the template database are handled by the lexextract tool of mayz. The refinement of the dictionary and the template database are handled by the lexrefine tool of mayz.
Let us explain how lexextract can be used for derivation from tranformed phrase structure trees and extraction of the template database and lexicon.
lexextract GRAMMAR EXTRACTION MODULE TREEBANK DERIVBANK LEXICON TEMPLATE LEXBANK | |
GRAMMAR ACQUISITION MODULE | A lilfes program in which the inverse schema and the inverse lexical rule are defined |
TREEBANK | input treebank (lildb format) |
DERIVBANK | derivation (lildb format) |
LEXICON | output file of the lexicon (lildb format) |
TEMPLATE | output file of the lexical entry template (lildb format) |
LEXBANK | output file of the terminal line of derivation(lildb format) |
Phrase structure trees taken from the input treebank are obtained by tranforming a corpus. HPSG-style derivation trees are obtained by using "devel/lexextract.lil" with transformed phrase structure trees. The type of HPSG-style derivation trees is defined in the file "derivtypes.lil" in Mayz. LEXICON and TEMPLATE in the above table refers to the dictionary and lexicon template database on ENJU.
root_constraints($Sign) :- $Sign = (SYNSEM\(LOCAL\CAT\(HEAD\MOD\[] & VAL\(SUBJ\[] & COMPS\[] & SPR\[] & SPEC\[] & CONJ\[])))). inverse_schema_binary(subj_head_schema, $Mother, $Left, $Right) :- $Left = (SYNSEM\($LeftSynsem & LOCAL\CAT\(HEAD\MOD\[] & VAL\(SUBJ\[] & COMPS\[] & SPR\[] & SPEC\[] & CONJ\[])))), $Subj = $LeftSynsem, $Right = (SYNSEM\(LOCAL\CAT\(HEAD\$Head & VAL\(SUBJ\[$Subj] & COMPS\[] & SPR\[] & SPEC\$Spec & CONJ\[])))), $Mother = (SYNSEM\(LOCAL\CAT\(HEAD\$Head & VAL\(SUBJ\[] & COMPS\[] & SPR\[] & SPEC\$Spec & CONJ\[])))). lexical_constraints(SURFACE\$Surface, SYNSEM\LOCAL\CAT\HEAD\AUX\copula_be) :- auxiliary_be($Surface).
Next, we obtain the dictionary and the template database from the feature structure of the leaf nodes of a derivation tree. The following interface predicates: lexical_entry_template/3, reduce_lexical_template/5, lexeme_name/4, word_count_key/2 are applied to the leaf nodes of the relevant derivation tree. Here is the order that they are applied:
The above interface predicates are found in "devel/lextemplate.lil". The file contains the following specification:
Let us look at some sample codes:
lexical_entry_template($WordInfo, $Sign, $Template) :-
$Sign = (SYNSEM\(LOCAL\CAT\$Cat &
NONLOCAL\$NonLocal)),
copy($Cat, $OutCat),
copy($NonLocal, $OutNonLocal),
$Template = (hpsg_word &
SYNSEM\(LOCAL\CAT\($OutCat & VAL\SUBJ\[$Subj]) &
NONLOCAL\$OutNonLocal)),
abstract_subj($Subj).
abstract_subj($Synsem) :-
restriction($Synsem, [LOCAL\, CAT\, HEAD\, POSTHEAD\]),
restriction($Synsem, [LOCAL\, CAT\, HEAD\, AGR\]),
restriction($Synsem, [LOCAL\, CAT\, HEAD\, ADJ\]),
restriction($Synsem, [LOCAL\, CAT\, HEAD\, AUX\]),
restriction($Synsem, [LOCAL\, CAT\, HEAD\, TENSE\]).
reduce_lexical_template($WordInfo, $LexEntry, $Key, $Lexeme, $Rules) :-
get_sign($LexEntry, $Sign),
get_lexeme($WordInfo, $LexEntry, $BaseWordInfo, $Lexeme1, [], $Rules1),
canonical_copy($Lexeme1, $Lexeme),
$Rules = $Rules1,
$BaseWordInfo = (BASE\$Base & POS\$POS),
$Key = (BASE\$Base & POS\$POS).
get_lexeme($WordInfo, $InTemplate, $NewWordInfo, $NewTemplate,
$InRules, $NewRules) :-
($InRules = [$Rule1|_] ->
upper_rule($Rule1, $Rule); true),
inverse_rule_lex($Rule, $WordInfo, $InTemplate, $WordInfo1, $Template1),
get_lexeme($WordInfo1, $Template1, $NewWordInfo, $NewTemplate,
[$Rule|$InRules], $NewRules).
Let us explain how we make use of lexrefine for refining the dictionary and template database extracted by lexextract.
lexrefine rule module original lexicon original template new exicon new template | |
rule module | the module where lexical rules are defined |
original lexicon | input lexicon |
original template | input template database |
new lexicon | refined lexicon |
new template | refined template |
First, the input template database is refined. This takes the form of applying lexical rules defined in the rule module. Next, the input dictionary is refined. Entries corresponding to templates disappeared with the refinement process are deleted from the dictionary. Entries corresponding to templates obtained from applying lexical rules are added to the dictionary. Finally, entries corresponding to unknown words are added to the dictionary.
expand_lexical_template($InTempName, $InTemplate, $Count, $LexRules, $NewTemplate) :- get_variable('*template_expand_threshold*', $Thres), ($Count > $Thres -> ordered_lexical_rules($LexRules), apply_ordered_lexical_rules(_, $InTemplate, $LexRules, _, $Template1), get_sign($Template1, $NewTemplate); $LexRules = [], get_sign($InTemplate, $NewTemplate)).
expand_lexicon($InKey, $NewTempName, $NewKey) :- $InKey = BASE\$Base & POS\$BasePOS, $NewTempName = LEXICAL_RULES\$Rules, rules_to_inflection($BasePOS, $Rules, $POSList), member($NewPOS, $POSList), $NewKey = BASE\$Base & POS\$NewPOS.
Next, lexrefine would add entries of unknown words to the dictionary. Any word that has a frequency lower than the threshold would be taken aas unseen words and new entries would be created for them. These new entries are created by using the interface predicate unknown_word_key/2. Details of the processing is given as follows:
The keys of unseen words in ENJU are feature structure of the type 'word' with the POS feature specified. Calling the interface predicate unknown_word_key/2 would return a feature structure of the type 'word' with the same POS value as the original key as the key of an unseen word.
unknown_word_key(POS\$POS, POS\$POS).