Lexicon extraction

Japanese version

The next step in the development cycle of Enju grammar is the acquisition of a lexicon and a template database based on the phrase structure tree transformed by the previous step. The dictionary used is a mapping between lexical information (of type 'word') and names of lexical entry templates (of type 'lex_entry'). The template database is a mapping between names of lexical entry templates and feature structures of lexical entry templates (of type 'hpsg_word')>

To be more precise, what is first done in this step is to come up with the derivation of a phrase based on HPSG using the transformed phrase structure tree as input. Next, a dictionary and a template database is acquired by extracting from the leave of such derivation. To refine the dictionary and the template database, entries with low frequency are removed and entries of unknown words are created. Derivation and the acquisition of the dictionary and the template database are handled by the lexextract tool of mayz. The refinement of the dictionary and the template database are handled by the lexrefine tool of mayz.

Extracting the Template Database and Lexicon

Let us explain how lexextract can be used for derivation from tranformed phrase structure trees and extraction of the template database and lexicon.

lexextract GRAMMAR EXTRACTION MODULE TREEBANK DERIVBANK LEXICON TEMPLATE LEXBANK
GRAMMAR ACQUISITION MODULEA lilfes program in which the inverse schema and the inverse lexical rule are defined
TREEBANKinput treebank (lildb format)
DERIVBANKderivation (lildb format)
LEXICONoutput file of the lexicon (lildb format)
TEMPLATEoutput file of the lexical entry template (lildb format)
LEXBANKoutput file of the terminal line of derivation(lildb format)

Phrase structure trees taken from the input treebank are obtained by tranforming a corpus. HPSG-style derivation trees are obtained by using "devel/lexextract.lil" with transformed phrase structure trees. The type of HPSG-style derivation trees is defined in the file "derivtypes.lil" in Mayz. LEXICON and TEMPLATE in the above table refers to the dictionary and lexicon template database on ENJU.

Transforming phrase structure trees to derviation trees

Let's explain how an input phrase structure tree is transformed to a HPSG-style derivation tree. The actual processing is done by calling the interface predicates: root_constraints/1’¡¤ inverse_schema_binary/4’¡¤inverse_schema_unary/3’¡¤ lexical_constraints/2 following the algorithm given below:
  1. Apply the constraint defined in root_constraints/1 to the root sign of the derivation tree
  2. For non-terminals which have the structure of a unary tree, the inverse_schema_unary/3 predicate is applied. For non-terminals which have the structure of a binary tree, the inverse_schema_binary/4 predicate is applied. The application of these two predicates to the sign of the mother node yields the sign of their daughter(s).
  3. Apply the constraint defined in lexical_constraints/2 to leafs. Constraints can vary with the corresponding lexical information (of the type 'word').
The predicate that handles this in ENJU is defined in "devel/invschema.lil" as follows: To illustrate:
root_constraints($Sign) :-
    $Sign = (SYNSEM\(LOCAL\CAT\(HEAD\MOD\[] &
				VAL\(SUBJ\[] & COMPS\[] & SPR\[] &
				     SPEC\[] & CONJ\[])))).

inverse_schema_binary(subj_head_schema,
		      $Mother, $Left, $Right) :-
    $Left = (SYNSEM\($LeftSynsem &
		     LOCAL\CAT\(HEAD\MOD\[] &
				VAL\(SUBJ\[] &
				     COMPS\[] &
				     SPR\[] &
				     SPEC\[] &
				     CONJ\[])))),
    $Subj = $LeftSynsem,
    $Right = (SYNSEM\(LOCAL\CAT\(HEAD\$Head &
				 VAL\(SUBJ\[$Subj] &
				      COMPS\[] &
				      SPR\[] &
				      SPEC\$Spec &
				      CONJ\[])))),
    $Mother = (SYNSEM\(LOCAL\CAT\(HEAD\$Head &
				  VAL\(SUBJ\[] &
				       COMPS\[] &
				       SPR\[] &
				       SPEC\$Spec &
				       CONJ\[])))).

lexical_constraints(SURFACE\$Surface, 
                    SYNSEM\LOCAL\CAT\HEAD\AUX\copula_be) :-
    auxiliary_be($Surface).

Extraction of Lexical Entries and Lexical Entry Templates

Next, we obtain the dictionary and the template database from the feature structure of the leaf nodes of a derivation tree. The following interface predicates: lexical_entry_template/3, reduce_lexical_template/5, lexeme_name/4, word_count_key/2 are applied to the leaf nodes of the relevant derivation tree. Here is the order that they are applied:

  1. Create lexical entry templates from the lexical information given at leaf nodes and the signs of these leafs using the lexical_entry_template/3 predicate
  2. Create lexeme templates from lexical entry templates using the reduce_lexical_template/5 predicate. The key for storing the lexeme template in the dictoinary is obtained from this process.
  3. If the lexeme template created above has not yet been stored in the template database, the name of the lexeme template is obtained by using the lexeme_name/4 predicate. The mapping between the key created in the last step and the lexeme template is stored in the dictionary.
  4. Obtain the key for counting the frequency of a word from the key used for looking up words from the dictionary using the word_count_key/2. The number of times the obtained key appears is incremented. (Even if different keys are used for looking up a certain word in the dictionary, the same key is returned for all instances of the word.

The above interface predicates are found in "devel/lextemplate.lil". The file contains the following specification:

Let us look at some sample codes:

lexical_entry_template($WordInfo, $Sign, $Template) :-
    $Sign = (SYNSEM\(LOCAL\CAT\$Cat &
		     NONLOCAL\$NonLocal)),
    copy($Cat, $OutCat),
    copy($NonLocal, $OutNonLocal),
    $Template = (hpsg_word &
		 SYNSEM\(LOCAL\CAT\($OutCat & VAL\SUBJ\[$Subj]) &
			 NONLOCAL\$OutNonLocal)),
    abstract_subj($Subj).

abstract_subj($Synsem) :-
    restriction($Synsem, [LOCAL\, CAT\, HEAD\, POSTHEAD\]),
    restriction($Synsem, [LOCAL\, CAT\, HEAD\, AGR\]),
    restriction($Synsem, [LOCAL\, CAT\, HEAD\, ADJ\]),
    restriction($Synsem, [LOCAL\, CAT\, HEAD\, AUX\]),
    restriction($Synsem, [LOCAL\, CAT\, HEAD\, TENSE\]).
reduce_lexical_template($WordInfo, $LexEntry, $Key, $Lexeme, $Rules) :-
    get_sign($LexEntry, $Sign),
    get_lexeme($WordInfo, $LexEntry, $BaseWordInfo, $Lexeme1, [], $Rules1),
    canonical_copy($Lexeme1, $Lexeme),
    $Rules = $Rules1,
    $BaseWordInfo = (BASE\$Base & POS\$POS),
    $Key = (BASE\$Base & POS\$POS).

get_lexeme($WordInfo, $InTemplate, $NewWordInfo, $NewTemplate,
	   $InRules, $NewRules) :-
    ($InRules = [$Rule1|_] ->
     upper_rule($Rule1, $Rule); true),
    inverse_rule_lex($Rule, $WordInfo, $InTemplate, $WordInfo1, $Template1),
    get_lexeme($WordInfo1, $Template1, $NewWordInfo, $NewTemplate,
	       [$Rule|$InRules], $NewRules).

Refining the dictionary and the template database

Let us explain how we make use of lexrefine for refining the dictionary and template database extracted by lexextract.

lexrefine rule module original lexicon original template new exicon new template
rule modulethe module where lexical rules are defined
original lexiconinput lexicon
original templateinput template database
new lexiconrefined lexicon
new templaterefined template

First, the input template database is refined. This takes the form of applying lexical rules defined in the rule module. Next, the input dictionary is refined. Entries corresponding to templates disappeared with the refinement process are deleted from the dictionary. Entries corresponding to templates obtained from applying lexical rules are added to the dictionary. Finally, entries corresponding to unknown words are added to the dictionary.

Refining the Template Database

The template database is refined in the following way:
  1. Template entries with frequencies larger than the threshold specified by the -tf option are stored in the the output template database.
  2. If the expand_lexical_template/5 predicate is defined,the entry template stored in the last step is applied to the expand_lexical_template/5predicate, the output obtained is stored in the output database. The number of times a new template appears is the same the number of times the original template appears.
Lexical reles would be applied to the template of lexems that have frequencies highter than some threshold values and lexical entry templates would be obtained.
expand_lexical_template($InTempName, $InTemplate, $Count, $LexRules, $NewTemplate) :-
     get_variable('*template_expand_threshold*', $Thres),
     ($Count > $Thres ->
      ordered_lexical_rules($LexRules),
      apply_ordered_lexical_rules(_, $InTemplate, $LexRules, _, $Template1),
      get_sign($Template1, $NewTemplate);
      $LexRules = [], 
      get_sign($InTemplate, $NewTemplate)).

Refining the Dictionary

This is the process which creates a new dictionary by removeing entries of templates deleted as a result of the refinement of the template database and adding to the dictionary entries for added templates.
  1. If the lexeme template of an entry of the original dictionary is left in the new template database, the entry is added to the new dictionary.
  2. If there is any new template created from an entry by expand_lexical_template/5, obtain a key for expand_lexicon/3 to store the new template in the new dictionary. The number of times the key of a lexeme database appears is added to the number of time the key of the new template appears. The keys in the dictionary of ENJU are of the type 'word' with the BASE feature and the POS feature specified. The keys of lexical entries have their POS value changed by the expand_lexicon/3 predicate. The value of the POS feature is determined by the lexical rule applied when creating the lexcial entry. For example, when creating the lexical entry of a verb from a lexeme entry with the rule for creating the third person singular form of a verb, the POS feature of the key of the lexical entry is assigned the value "VBZ".
    expand_lexicon($InKey, $NewTempName, $NewKey) :-
        $InKey = BASE\$Base & POS\$BasePOS,
        $NewTempName = LEXICAL_RULES\$Rules,
        rules_to_inflection($BasePOS, $Rules, $POSList),
        member($NewPOS, $POSList),
        $NewKey = BASE\$Base & POS\$NewPOS.
    

    Adding Unseen Words to the Dictionary

    Next, lexrefine would add entries of unknown words to the dictionary. Any word that has a frequency lower than the threshold would be taken aas unseen words and new entries would be created for them. These new entries are created by using the interface predicate unknown_word_key/2. Details of the processing is given as follows:

    1. If the new dictionary contains entries whose frequencies are lower than the threshold (specified by the -uwf option), use the unknown_word_key/2 predicate to generate keys for lexical entries of unseen words from the keys of these entries.
    2. Keys of templates whose original keys have frequencies higher than the threshold (specified by the -utf option) and keys of unseen words are added to the new dictionary.
    3. If the frequency of the original key is below the threshold (specified by the -wf option), the entry of the original key is deleted.

    The keys of unseen words in ENJU are feature structure of the type 'word' with the POS feature specified. Calling the interface predicate unknown_word_key/2 would return a feature structure of the type 'word' with the same POS value as the original key as the key of an unseen word.

    unknown_word_key(POS\$POS, POS\$POS).
    

    Enju Developers' Manual Enju Home Page Tsujii Laboratory
    MIYAO Yusuke (yusuke@is.s.u-tokyo.ac.jp)