Unigram model

Japanese version

Here we explain the unigram model used for disambiguation during parsing.

The unigram model is for assigning lexical entries to words. A word is w. The string containing the word is s. A lexical entry is l. We try to look for p(l|w,s). The probablistic model of ENJU is a Maximum Entropy model. The model would come up with the best weight for features. (For details, please refer to Probablistic Model). This is done by using the Maximum Entropy estimatorAmis and the following tools that work with Amis: unimakeramisfilter.

The procedures are as follows:

  1. Output probablistic events to a file with unimaker
  2. Apply masks to the probablistic events and extract the relevant features for creating data files in Amis format
  3. Calculate the best weight for the features

A probablistic event is a string separated by // like the following:

ms-period-//NNP//[D< N.3sg>]_lxm-noun_adjective_rule//haag//NNP//haag//NNP//uni
The last field indicates the category of the event. The other fields are symbols that represent features of the event. unimaker would output a word and the probablistic event correpsonding to the assignment of a lexical entry to it.

amisfilter would apply a mask to a probablistic eventa according to the category of the category of the event. A mask determines which of the fields are used together to create a certain feature. For the above probablistic event, applying the mask (0,1,1,1,1,1,1) would create the following feature:

_//NNP//[D< N.3sg>]_lxm-noun_adjective_rule//haag//NNP//haag//NNP//uni
amisfilter extracts features from probablistic events in this way and generates data files in amis format. Finally, amis would calcultate the best weight for each of these features.

Below we explain each of the steps involved.

Outputing Probablistic Events to Files

Now let us explain how we use unimaker to output probablistic events to files.

Name of unimaker Model Event Extraction Module Derivbank Event File
Name of ModelThe name of the modelthe name of the probablistic model(used for parsing)
Event Extraction ModuleThe lilfes module that includes the predicate for event extraction
dervibankThe derivbank resulted from grammar acquisition(lildb format)
Event FileThe file which probablistic events are outputed(in text format or bompressed in gz/bz format)

The TERM TEMPLATE feature of the terminal node of a derivation tree is assigned the name of a lexical entry template (a combination of the name of a lexeme template and the history of lexical rule application) unimaker would generate probablistic events from the word information(the 'word'-typed value assigned to the TERM WORD feature) of each terminal node, the name of a lexical entry template (the value of the TERM_TEMPLATE feature) and the word information of terminal nodes nearby.

  1. Create derivation word lattice from derivation by calling the um_derivation_to_deriv_word_lattice/2 predicate. The WORD feature of an element of derivation word lattice is assigned the terminal node of a derivation tree as its value.
  2. Create a word lattice from a derivation tree by calling the um_derivation_to_word_lattice/2 predicate and insert it in a chart.
  3. Output probablistic events of positive examples and negative examples from features of the derivation word lattice. The values of the WORD feature of eleements in the word lattice, that is, the terminal nodes of the relevant derivation tree, are extracted and processed in the following procedures.
    1. Supply the um_correct_lexical_entry/2 predicate with the terminal nodes of a derivation tree for obtaiing the correct lexical entries.
    2. Supply the um_complement_lexical_entry/2 predicate with a terminal node of a derviation tree to obtain the correct name of a lexical entry.
    3. The lexical entry obtained is passed to the extract_lexical_event/4 predicate for extracting (the category and feature of) proabablistic events and outputing them. At this point, the probabilistic events obtained from correct lexical entries are outputed as positive examples and those obtained from other lexical entries are outputed as negative examples.
When outputing probablistic events with the extract_lexical_event/4 predicate, information from the word lattice of words nearby can be included in the Feature field. This is for inserting the word lattice in the relevant chart. The derviation word lattice of words nearby are not included the FIELD field. The predicate that creates two types of word lattice from terminal nodes of a derivation tree is found in "devel/unimake.lil". "devel/unimake.lil" also includes a predicate for finding out the correct lexical entries to be used at the terminal nodes of a derviation tree. These prediates will be explained in the following paragraphy.
  • A word lattice created by the um_derivation_to_word_lattice/2 predicate contains a left_position feature and a right_position feature that tells the position of the word corresponding to the word lattice. It also has the WORD feature which is assigned the value of the TERM_WORD feature of a terminal node of a derviation tree.
  • A derviation word lattice created created by um_derivation_to_deriv_word_lattice/2 has features that tells the position of the corresponding word. It also has a WORD feature which is assigned the terminal nodes of a derivation tree.
  • The um_correct_lexical_entry/2 predicate would return the correct lexical entry from lexical entries assigned to the terminal nodes of a derivation tree. The returned lexical entry shares the same feature structure with the lexical entries supplied to the predicate.
  • The um_complement_lexical_entry/2 predicate would return one by one the lexical entries excluded by um_correct_lexical_entry/2 from lexcial entries assigned to the terminal nodes of a derivation tree.
These predicates are defined in the following way:
um_derivation_to_word_lattice(derivation_internal & DERIV_DTRS\$Dtrs,
                              $WordLattice) :-
    um_derivation_to_word_lattice_dtrs($Dtrs, $WordLattice). %% recursive predicate
um_derivation_to_word_lattice(derivation_terminal & TERM_WORD\$Word,
			      [left_position\$LPos & right_position\$RPos &
                               word\$LexEntry]) :-
    $LexEntry = [$Word],
    $Word = POSITION\$LPos,
    $RPos is $LPos + 1.

um_derivation_to_deriv_word_lattice(derivation_internal & DERIV_DTRS\$Dtrs,
                                    $WordLattice) :-
    um_derivation_to_deriv_word_lattice_dtrs($Dtrs, $WordLattice). %%  recursive predicate
um_derivation_to_deriv_word_lattice(derivation_terminal & $Term & TERM_WORD\$Word,
			      [left_position\$LPos & right_position\$RPos &
                               word\$LexEntry]) :-
    $LexEntry = $Term,
    $Word = POSITION\$LPos,
    $RPos is $LPos + 1.

um_correct_lexical_entry(TERM_WORD\$Word & LEXENTRY_SIGN\$Sign, $LexName) :-
    lookup_lexicon($Word, $TempNameList),
    member($TempName, $TempNameList),
    lookup_template($TempName, $LexEntry),
    equivalent($LexEntry, $Sign),
    !,
    $LexName = LEX_WORD\$Word & LEX_TEMPLATE\$TempName.

um_complement_lexical_entry(TERM_WORD\$Word & LEXENTRY_SIGN\$Sign, $LexName) :-
     lookup_lexicon($Word, $TempNameList1),
     check_coverage($TempNameList1, $Sign, $TempName1), %% check whether $TempNameList1 contains 
                                                        %% elements that carry $Sign
     findall($Lex,
 	    (member($TN, $TempNameList1),
 	     $TN \= $TempName1,
 	     $Lex = LEX_WORD\$Word & LEX_TEMPLATE\$TN),
 	    $LexList),
     member(LEX_TEMPLATE\$TempName, $LexList),
     $LexName = LEX_WORD\$Word & LEX_TEMPLATE\$TempName.

The extract_lexical_event/4 predicate, which is used for extracting probablistic events is found in "grammar/unievent.lil". Probablistic events of the category "uni" is extracted by this predicate. Its feature contains the following fields:

  • the word that precedes the word that immediately precedes the current word (string and POS, strings generated by stemming and POS)
  • the word that immediatellly precedes the current word(strings and POS, strings generated by stemming and POS)
  • the current word(stringss and POS,strings generated by stemming and POS), lexical entries and names of lexemes
  • the word that immediately comes after the current word(string and POS,strings geeeenerated by stemming and POS)
  • the word that comes after the word that immediately comes after the current word(string and POS, stemming and POS)
They are specified in the following ways:
extract_lexical_event("hpsg-uni", "uni", $LexEntry, $Event) :-
    $LexEntry = (LEX_WORD\ (SURFACE\ $Surface &
			    POS\ $Pos &
			    BASE\ $Base &
			    BASE_POS\ $BasePOS &
			    POSITION\ $Position) &
		 LEX_TEMPLATE\($LexTemplate & LEXEME_NAME\$LexemeName)),
    lex_template_label($LexTemplate, $LexName),
    $PositionN2 is $Position - 2,
    $PositionN1 is $Position - 1,
    $PositionP1 is $Position + 1,
    $PositionP2 is $Position + 2,
    $PositionP3 is $Position + 3,
    $PositionP4 is $Position + 4,
    lexical_event($PositionN2, $PositionN1, $Event, $Event2),     %% -2
    lexical_event($PositionN1, $Position,   $Event2, $Event3),    %% -1
    $Event3 = [$Surface, $Pos, $LexName, $Base, $BasePOS, $LexemeName|$Event4],
    lexical_event($PositionP1, $PositionP2, $Event4, $Event5),    %%  1
    lexical_event($PositionP2, $PositionP3, $Event5, $Event6),    %%  2
    lexical_event($PositionP3, $PositionP4, $Event6, []).         %%  3

The event file outputed by unimaker looks like the following. (One event is outputed as one line. In the case of event_2_0, there are 3 probablistic events.)

event_2_0
1       BOS//BOS//BOS//BOS//BOS//BOS//BOS//BOS//ms-period-//NNP//
[D< N.3sg>]_lxm-noun_adjective_rule//ms-period-//NNP//
[D< N.3sg>]_lxm//haag//NNP//haag//NNP//plays//VBZ//play//VB//
elianti//NNP//elianti//NNP//uni
0       BOS//BOS//BOS//BOS//BOS//BOS//BOS//BOS//ms-period-//NNP//
[D< N.3sg>]_lxm//ms-period-//NNP//[D< N.3sg>]_lxm//haag//
NNP//haag//NNP//plays//VBZ//play//VB//elianti//NNP//elianti//NNP//
uni
0       BOS//BOS//BOS//BOS//BOS//BOS//BOS//BOS//ms-period-//NNP//
[< NP.3sg.adj>]NP.adj_mod//ms-period-//NNP//
[< NP.3sg.adj>]NP.adj_mod//haag//NNP//haag//NNP//plays//VBZ//
play//VB//elianti//NNP//elianti//NNP//uni

event_2_1
...
In this case, event_2_0 is the word "Ms." に "[D]_lexm-noun_adjective_rule" で 表される語彙項目が対応する確率イベントを表しています. The "1" in the beginning of the 2nd line indicates that it is a positive example. The "0" in the beginning of the 3rd line and the 4th line indicates that they are negative examples. "BOS" marks the beginning of a sentence.

Applying masks to extract features

Let us illustrate how to use amisfilter to apply masks to the probablistic events outputed above and extract features for generating a data file in Amis format.

amisfilter Name of Model Mask Module Probablistic Event File Count File Model File Event File
Name of Model Name of Proablistic Model(Also used in parsing)
Mask ModuleThe lilfes module that applies masks to probablistic events
Probablistic Event FileThe Inpu Probablistic Event File(text file or gz/bz file)
Count FileThe file that outputs the frequencies of features(text file)
Model FileModel File(AmisModel foramt)
Event FileEvent File(AmisEvent format)

The actual processing being done is as follows

  1. Create features by applying masks defined in the category corresponding to probablistic events that represents observed events in the probablistic event file. Category-specific Masks are defined by the feature_mask/3 predicate.
  2. The frequency of a feature appearing with observed events are counted and outputed to the count file.
  3. Those features that have frequencies above a predefined threshold are adopted and model files and event files in Amis format are created.

The feature_mask/3 predicate is found in "grammar/lexmask.lil". The mask    


Enju Developers' Manual Enju Home Page Tsujii Laboratory
MIYAO Yusuke (yusuke@is.s.u-tokyo.ac.jp)