Lexicon extraction

Japanese version

The next step in the development cycle of Enju grammar is the acquisition of a lexicon and a template database based on the phrase structure tree transformed by the previous step. The dictionary used is a mapping between lexical information (of type 'word') and names of lexical entry templates (of type 'lex_entry'). The template database is a mapping between names of lexical entry templates and feature structures of lexical entry templates (of type 'hpsg_word')>

To be more precise, what is first done in this step is to come up with the derivation of a phrase based on HPSG using the transformed phrase structure tree as input. Next, a dictionary and a template database is acquired by extracting from the leave of such derivation. To refine the dictionary and the template database, entries with low frequency are removed and entries of unknown words are created. Derivation and the acquisition of the dictionary and the template database are handled by the lexextract tool of mayz. The refinement of the dictionary and the template database are handled by the lexrefine tool of mayz.

Extracting the Template Database and Lexicon

Let us explain how lexextract can be used for derivation from tranformed phrase structure trees and extraction of the template database and lexicon.

lexextract GRAMMAR EXTRACTION MODULE TREEBANK DERIVBANK LEXICON TEMPLATE LEXBANK
GRAMMAR ACQUISITION MODULE	A lilfes program in which the inverse schema and the inverse lexical rule are defined
TREEBANK	input treebank (lildb format)
DERIVBANK	derivation (lildb format)
LEXICON	output file of the lexicon (lildb format)
TEMPLATE	output file of the lexical entry template (lildb format)
LEXBANK	output file of the terminal line of derivation(lildb format)

Phrase structure trees taken from the input treebank are obtained by tranforming a corpus. HPSG-style derivation trees are obtained by using "devel/lexextract.lil" with transformed phrase structure trees. The type of HPSG-style derivation trees is defined in the file "derivtypes.lil" in Mayz. LEXICON and TEMPLATE in the above table refers to the dictionary and lexicon template database on ENJU.

Transforming phrase structure trees to derviation trees

Let's explain how an input phrase structure tree is transformed to a HPSG-style derivation tree. The actual processing is done by calling the interface predicates: root_constraints/1Тбд inverse_schema_binary/4Тбдinverse_schema_unary/3Тбд lexical_constraints/2 following the algorithm given below:

Apply the constraint defined in root_constraints/1 to the root sign of the derivation tree
For non-terminals which have the structure of a unary tree, the inverse_schema_unary/3 predicate is applied. For non-terminals which have the structure of a binary tree, the inverse_schema_binary/4 predicate is applied. The application of these two predicates to the sign of the mother node yields the sign of their daughter(s).
Apply the constraint defined in lexical_constraints/2 to leafs. Constraints can vary with the corresponding lexical information (of the type 'word').

The predicate that handles this in ENJU is defined in "devel/invschema.lil" as follows:

All valence features (grouped under SYNSEM\LOCAL\CAT\VAL\) of the root sign: SUBJ, COMPS, SPR, SPEC, CONJ are assigned empty lists as their values.
Applying the Head-Subject schema to the root sign in reverse direction would yield a pair of daughter nodes. The SYNSEM\LOCAL\CAT\VAL\SUBJ feature of the right daughter structure-shares its value with the SYNSEM value of the left daughter.
As part of the preprocessing of phrases consisting "be", "was" at leaf level, the SYNSEM\LOCAL\CAT\HEAD\AUX value of the corresponding sign is assigned the value 'copula_be'Тбе

To illustrate:

root_constraints($Sign) :-
    $Sign = (SYNSEM\(LOCAL\CAT\(HEAD\MOD\[] &
				VAL\(SUBJ\[] & COMPS\[] & SPR\[] &
				     SPEC\[] & CONJ\[])))).

inverse_schema_binary(subj_head_schema,
		      $Mother, $Left, $Right) :-
    $Left = (SYNSEM\($LeftSynsem &
		     LOCAL\CAT\(HEAD\MOD\[] &
				VAL\(SUBJ\[] &
				     COMPS\[] &
				     SPR\[] &
				     SPEC\[] &
				     CONJ\[])))),
    $Subj = $LeftSynsem,
    $Right = (SYNSEM\(LOCAL\CAT\(HEAD\$Head &
				 VAL\(SUBJ\[$Subj] &
				      COMPS\[] &
				      SPR\[] &
				      SPEC\$Spec &
				      CONJ\[])))),
    $Mother = (SYNSEM\(LOCAL\CAT\(HEAD\$Head &
				  VAL\(SUBJ\[] &
				       COMPS\[] &
				       SPR\[] &
				       SPEC\$Spec &
				       CONJ\[])))).

lexical_constraints(SURFACE\$Surface, 
                    SYNSEM\LOCAL\CAT\HEAD\AUX\copula_be) :-
    auxiliary_be($Surface).

Extraction of Lexical Entries and Lexical Entry Templates

Next, we obtain the dictionary and the template database from the feature structure of the leaf nodes of a derivation tree. The following interface predicates: lexical_entry_template/3, reduce_lexical_template/5, lexeme_name/4, word_count_key/2 are applied to the leaf nodes of the relevant derivation tree. Here is the order that they are applied:

Create lexical entry templates from the lexical information given at leaf nodes and the signs of these leafs using the lexical_entry_template/3 predicate
Create lexeme templates from lexical entry templates using the reduce_lexical_template/5 predicate. The key for storing the lexeme template in the dictoinary is obtained from this process.
If the lexeme template created above has not yet been stored in the template database, the name of the lexeme template is obtained by using the lexeme_name/4 predicate. The mapping between the key created in the last step and the lexeme template is stored in the dictionary.
Obtain the key for counting the frequency of a word from the key used for looking up words from the dictionary using the word_count_key/2. The number of times the obtained key appears is incremented. (Even if different keys are used for looking up a certain word in the dictionary, the same key is returned for all instances of the word.

The above interface predicates are found in "devel/lextemplate.lil". The file contains the following specification:

When creating lexical entry templates, the CAT and NONLOCAL features of the template are assigned values copied from the CAT and NONLOCAL features of leaf nodes with some of constraints relaxed. This is done by substituting the value of an affected feature with the most general type.
Lexical rules are applied in the reverse direction to lexical entry templates.
The name of a lexeme is created by combining the string assigned to its CAT feature, the string assigned to its VAL\SUBJ feature and a number. For example, the lexeme of a transitive verb that takes a subject noun phrase and an object noun phrase as arguments are named [npVPnp]_lxm_10. The "np" in the beginning of the string stands for the value of the SUBJ feature of the verb. The "VP" that comes after "np" stands for the value of the CAT feature of the verb. The final "np" stands for the value of the COMP featur eof the verb.

Let us look at some sample codes:

lexical_entry_template($WordInfo, $Sign, $Template) :-
    $Sign = (SYNSEM\(LOCAL\CAT\$Cat &
		     NONLOCAL\$NonLocal)),
    copy($Cat, $OutCat),
    copy($NonLocal, $OutNonLocal),
    $Template = (hpsg_word &
		 SYNSEM\(LOCAL\CAT\($OutCat & VAL\SUBJ\[$Subj]) &
			 NONLOCAL\$OutNonLocal)),
    abstract_subj($Subj).

abstract_subj($Synsem) :-
    restriction($Synsem, [LOCAL\, CAT\, HEAD\, POSTHEAD\]),
    restriction($Synsem, [LOCAL\, CAT\, HEAD\, AGR\]),
    restriction($Synsem, [LOCAL\, CAT\, HEAD\, ADJ\]),
    restriction($Synsem, [LOCAL\, CAT\, HEAD\, AUX\]),
    restriction($Synsem, [LOCAL\, CAT\, HEAD\, TENSE\]).

reduce_lexical_template($WordInfo, $LexEntry, $Key, $Lexeme, $Rules) :-
    get_sign($LexEntry, $Sign),
    get_lexeme($WordInfo, $LexEntry, $BaseWordInfo, $Lexeme1, [], $Rules1),
    canonical_copy($Lexeme1, $Lexeme),
    $Rules = $Rules1,
    $BaseWordInfo = (BASE\$Base & POS\$POS),
    $Key = (BASE\$Base & POS\$POS).

get_lexeme($WordInfo, $InTemplate, $NewWordInfo, $NewTemplate,
	   $InRules, $NewRules) :-
    ($InRules = [$Rule1|_] ->
     upper_rule($Rule1, $Rule); true),
    inverse_rule_lex($Rule, $WordInfo, $InTemplate, $WordInfo1, $Template1),
    get_lexeme($WordInfo1, $Template1, $NewWordInfo, $NewTemplate,
	       [$Rule|$InRules], $NewRules).

Refining the dictionary and the template database

Let us explain how we make use of lexrefine for refining the dictionary and template database extracted by lexextract.

lexrefine rule module original lexicon original template new exicon new template
rule module	the module where lexical rules are defined
original lexicon	input lexicon
original template	input template database
new lexicon	refined lexicon
new template	refined template

First, the input template database is refined. This takes the form of applying lexical rules defined in the rule module. Next, the input dictionary is refined. Entries corresponding to templates disappeared with the refinement process are deleted from the dictionary. Entries corresponding to templates obtained from applying lexical rules are added to the dictionary. Finally, entries corresponding to unknown words are added to the dictionary.

Refining the Template Database

The template database is refined in the following way:

Template entries with frequencies larger than the threshold specified by the -tf option are stored in the the output template database.
If the expand_lexical_template/5 predicate is defined,the entry template stored in the last step is applied to the expand_lexical_template/5predicate, the output obtained is stored in the output database. The number of times a new template appears is the same the number of times the original template appears.

Lexical reles would be applied to the template of lexems that have frequencies highter than some threshold values and lexical entry templates would be obtained.

expand_lexical_template($InTempName, $InTemplate, $Count, $LexRules, $NewTemplate) :-
     get_variable('*template_expand_threshold*', $Thres),
     ($Count > $Thres ->
      ordered_lexical_rules($LexRules),
      apply_ordered_lexical_rules(_, $InTemplate, $LexRules, _, $Template1),
      get_sign($Template1, $NewTemplate);
      $LexRules = [], 
      get_sign($InTemplate, $NewTemplate)).

Refining the Dictionary

This is the process which creates a new dictionary by removeing entries of templates deleted as a result of the refinement of the template database and adding to the dictionary entries for added templates.

If the lexeme template of an entry of the original dictionary is left in the new template database, the entry is added to the new dictionary.
If there is any new template created from an entry by expand_lexical_template/5, obtain a key for expand_lexicon/3 to store the new template in the new dictionary. The number of times the key of a lexeme database appears is added to the number of time the key of the new template appears. The keys in the dictionary of ENJU are of the type 'word' with the BASE feature and the POS feature specified. The keys of lexical entries have their POS value changed by the expand_lexicon/3 predicate. The value of the POS feature is determined by the lexical rule applied when creating the lexcial entry. For example, when creating the lexical entry of a verb from a lexeme entry with the rule for creating the third person singular form of a verb, the POS feature of the key of the lexical entry is assigned the value "VBZ".
```
expand_lexicon($InKey, $NewTempName, $NewKey) :-
    $InKey = BASE\$Base & POS\$BasePOS,
    $NewTempName = LEXICAL_RULES\$Rules,
    rules_to_inflection($BasePOS, $Rules, $POSList),
    member($NewPOS, $POSList),
    $NewKey = BASE\$Base & POS\$NewPOS.
```
Adding Unseen Words to the Dictionary

Next, lexrefine would add entries of unknown words to the dictionary. Any word that has a frequency lower than the threshold would be taken aas unseen words and new entries would be created for them. These new entries are created by using the interface predicate unknown_word_key/2. Details of the processing is given as follows:
1. If the new dictionary contains entries whose frequencies are lower than the threshold (specified by the -uwf option), use the unknown_word_key/2 predicate to generate keys for lexical entries of unseen words from the keys of these entries.
2. Keys of templates whose original keys have frequencies higher than the threshold (specified by the -utf option) and keys of unseen words are added to the new dictionary.
3. If the frequency of the original key is below the threshold (specified by the -wf option), the entry of the original key is deleted.
The keys of unseen words in ENJU are feature structure of the type 'word' with the POS feature specified. Calling the interface predicate unknown_word_key/2 would return a feature structure of the type 'word' with the same POS value as the original key as the key of an unseen word.
```
unknown_word_key(POS\$POS, POS\$POS).
```
Enju Developers' Manual Enju Home Page Tsujii Laboratory

MIYAO Yusuke (yusuke@is.s.u-tokyo.ac.jp)