Parsing process example

Japanese version

In this section, we would take the sentence "I saw a girl with a telescope." as an example to illustrate how parsing is done by Enju.

Preprocessing

As described in the Preprocessing section, preprocessing is handled by the sentence_to_word_lattice/2 predicate.

The input sentences is first tagged by the external_tagger/2 predicate.

"I/PRP saw/VBD a/DT girl/NN with/IN a/DT telescope/NN ./."
The tagged sentence is then processed into a list of Word/POS pairs and stemmed. Next, feature structures of the 'extent_word' type is created by the token_to_word_lattice/2 predicate for each word. Below is the representation of "saw" as a feature structure of the 'extent_word' type.
extent_word
left_pos1
right_pos2
word <
word
INPUT "saw"
SURFACE "saw"
BASE "see"
INPUT_POS 1 "VBD"
POS 1
BASE_POS "VB"
POSITION 1
>
A list of this kind of feature structures is the output of the sentence_to_word_lattice/2 predicate.

Creating Lexical Entries

We use the predicates lexical_entry/2 and lexical_entry_sign/2 for passing the signs of lexical entries to up.

The lexical_entry/2 predicate is called by the value of the word feature of a feature structure of the 'extent_word' type. The value assigned to this feature is a list of 'word'. lookup_lexicon/2 would look for the template for each element of the list. The subclauseslexicon_lookup_key and unknown_word_lookup_key/2 of lookup_lexicon/2would come up with keys for retrieving lexical entries from the dictionary. For the feature structure corresponding to "saw", the value of the word feature is a one-item list. The key corresponding to this item created by lexicon_lookup_key/2 is made up by the based form and the POS of the input word. The key for "saw" is given as follows:

word
BASE "see"
POS "VBD"
Looking up the dictionary using the above key would return a list of 'lex_template'. Below is the list resulted:
lex_template
LEXEME_NAME "[npVPnp]xm_10"
LEXICAL_RULES <past_verb_rule>
Supplying the above feature structure of the 'lex_template' type to the lexical_entry/2 predicate would return the following feature structure:
lex_entry
LEX_WORD
word
INPUT "saw"
SURFACE "saw"
BASE "see"
INPUT_POS 1 "VBD"
POS 1
BASE_POS "VB"
POSITION 1
LEX_TEMPLATE
lex_template
LEXEME_NAME "[npVPnp]xm_10"
LEXICAL_RULES <past_verb_rule>
After calling the lexical_entry/2 predicate for each feature structure of the 'extent_word' type resulted from preprocessing, a FOM (figure of merit) score is calculated for each lexical entry with the fom_lexical_entry/3 predicate. Supplying the fom_lexical_entry/3 with a feature structure of the 'word' type and a feature structure of the 'lex_entry' type a FOM score. Lexical entry assigned a low FOM score would be prunned.

Next, the lexical_entry_sign/2 predicate is called for each of the remaining feature structure of the 'lex_entry' type. Under the lexical_entry_sign/2 predicate, a lookup_template/2 predicate is defined for looking up the dictionary using the value of the LEX_TEMPLATE feature of a feature structure of the 'lex_entry' type. For the feature structure of the lex_entry type given immediately before this paragraph, we would obtain a template like this. Information that varies from lexical entries would be added to the template. In addition, the value of the LEX_WORD\INPUT feature would unify with the value of the PHON feature. . The resultof these operations would be the output of the lexical_entry_sign/2 predicate.

The output is a feature structure of the sign type. It is supplied to the reduce_sign/3 predicate for factoring.Factoring is the operation of removing information from a feature structure of the sign type. Instances of equivalent signs are contracted into a single sign. The contracted sign is used for any parsing to be done in future. Supplying a sign to the reduce_sign/3 would return a contracted sign and the information removed from the sign. .(For more details, please look at the linksection.) After that, supplying with the fom_terminal/4 predicate a feature structure of 'lex_entry' type, a contracted sign corresponding to the same feature structure and prunned information during contraction returns a FOM score. At this point, sign with low FOM scores are prunned as well.

Chart Parsing

UP makes use of data structures generally referred to as edges, charts and links for parsing.

Edges

An edge is made up by a left position, a right position, a sign, a list of links and a FOM score. The left position and the right position is for marking the position of a parsed fragment. In short, an edge is essentially the representation of the sign corresponding to a fragment starting from the left position and ending at the right position. The extra information are the information stored as a list of links for recovering prunned information and a FOM score for disambiguation.

Chart

A chart is a data structure used for preserving the state of a parse. We use a 2-D table to represent such a chart. Each element in the chart is referred to as a CKY cell. For the sentence "I saw a girl with a telescope", the chart constructed at the point parsing starts is as follows:.

チャート
Each CKY cell is written as S[i][j] where i and j are positions of the input sentence. S[0][1] is where the IDs of edges corresponding to "I" are stored. S[1][2] is where the IDs of edges corresponding to "saw" are stored.,S[6][7] is where the IDs of edges corresponding to "telescope" are stored.As parsing proceeds, the IDs of edges corresponding to larger fragments are stored in CKY cells as wel. For example, S[0][2] is where IDs of edges corresponding to "I saw" are stored. S[1][3] is where IDs of edges corresponding to "saw a" are stored. S[0][3] is for storing the IDs of edges corresponding to the even larger fragment "I saw a".Suppose the input sentence is of length n,the CKY cell that stores the IDs of edges corresponding to the whole sentence would be S[0][n]. For the example sentence, this CKY cell would be S[0][7].

Link

When parsing proceeds, syntactic rules defined by id_schema_unary/4 and id_schema_binary/5would be applied to child nodes for forming mother nodes. For example, the result of applying syntactic rules on edges stored in S[1][2] and S[2][4] would be stored in S[1][4].At this point, signs that are equivalent and stored in the same CKY cell would be contracted into one sign as a result of the factoring operation defined by the reduce_sign/3 predicate. Factoring removes information that can be recovered after parsing finishes. In order to allow such recovery, data structures referred to as links are created for enabling such recovery. Information removed by factoring includes the signs of child nodes and semantic representations. This kind of information are stored as links. Given that edges resulted from contraction are formed with more than one edges. one such edge would correspond to multiple links. After parsing finishes, the whole parse tree can be traced through the links between edges.

For the example sentence "I saw a girl with a telescope.", among the edges stored in S[1][2] is one edge that contains a sign that looks like this and among the edges stored in S[2][4] is another edge that contains another sign that looks like this.The 'head_comp_schema' specified by theid_schema_binary/5 predicate is then applied to these two signs. The arguments of the id_schema_binary/5 predicate are the name of the schema to be applied, the feature structure of the left daughter, the feature structure of the right daughter, the feature structure of the mother node and the lilfes program to be executed after applying the specified schema. (The id_schema_unary/4 predicate has only one feature structure for daughter nodes.) The feature structures of the left daughter and the right daughter of the 'head_comp_schema' are roughly as follows:

F
SYNSEM|LOCAL|CAT|VAL SUBJ: A
COMPS: < B | C >
SPR: D
SPEC: E
CONJ: < >
, G
SYNSEM: B
The two signs given above are unified with the edges of daughter nodes. The sign of the mother node would become as follows. (The Valence Principle is hardcoded in the schema).
SYNSEM|LOCAL|CAT|VAL SUBJ: A
COMPS: C
SPR: D
SPEC: E
CONJ: < >
DTR
HEAD_DTR: F
NONHEAD_DTR: G
Next, depending on the lilfes program specified to be executed after applying the schema, the signs of both the mother node and the daughter nodes would be unified with feature structures that satisfy other principles. The sign of the mother node obtained would be processed by reduce_sign/3for factoring.What reduce_sign/3 does is to return a contracted sign and a feature structure stored in a link. The original sign can be recovered by unifying the contracted sign with the feature structure stored in a link. As a result of factoring, we can obtain a contracted signfrom a complete sign. Now we pass to the fom_binary/6 predicate the name of the relevant schema (In this case, the 'head_comp_schema'), the sign of the left daughter, the sign of the right daughter, the sign of the mother node after factoring and the feature structure stored in a link during factoring. The output of the predicate would be the FOM score for applying the relevant schema. (If it is a schema specified by the id_schema_unary/4 predicatebeing applied, the fom_unary/5 predicate would be used instead of fom_binary/6 ) In this way, the edge of the mother node is calculated and stored in S[1][4].

So parsing proceeds until we come to the CKY cell corresponding to the whole sentence (In this case, it is S[0][7]). A FOM score would be calculated for each of the edge in this cell by the fom_root/2 predicate. Supplying the sign of a root node to fom_root/2 would return the FOM score of that sign. Whether these edges satisfy the root condition would be determined by the root_sign/1 predicate. For the example sentence "I saw a girl with a telescope.", there is a sign like this in S[0][7] which gets the highest FOM among all edges that satisfy the root condition. This sign is the result of contracting two signs into one, one of which corresponding to the analysis that attaches "with a telescope" to the VP, the other corresponding to the analysis that attaches "with a telescope" to the NP. By tracing through the edges and the links, following the path with the best FOM score, we can recover the sign with the best FOM score.

Output

The sign with the highest FOM score obtained from the above parse would be outputed. For the above example, the output from the "enju/outputdeps.lil" module would be as follows:

ROOT  ROOT  ROOT  ROOT  -1  ROOT  saw       see       VBD  VB   1
saw   see   VBD   VB     1  ARG1  I         i         PRP  PRP  0
saw   see   VBD   VB     1  ARG2  girl      girl      NN   NN   3
a     a     DT    DT     2  ARG1  girl      girl      NN   NN   3
a     a     DT    DT     5  ARG1  telescope telescope NN   NN   6
with  with  IN    IN     4  ARG1  saw       see       VBD  VB   1
with  with  IN    IN     4  ARG2  telescope telescope NN   NN   6


Enju Developers' Manual Enju Home Page Tsujii Laboratory
MIYAO Yusuke (yusuke@is.s.u-tokyo.ac.jp)