Structure of the Enju system

Structure of Sign
System architecture

Structure of Sign

The grammar of Enju is based on the theory of Head-driven Phrase Structure Grammar (HPSG). In HPSG, constraints on the structure of a language are represented with typed feature structures. LiLFeS Home Page presents a brief introduction to typed feature structures.

One of the characteristics of HPSG is that most of the constraints on syntax and semantics are represented in lexical entries, while only a small number of grammar rules (corresponding to CFG rules) are defined and they represent general constraints irrelevant to specific words. This is because the constraints on the structure of a sentence are mostly introduced by words.

Syntactic/semantic constraints of words/phrases are represented in the data structure called sign. In the current implementation of Enju, the structure of the sign basically follows [Pollar and Sag, 1994] and LinGO English Resource Grammar (ERG), while the type hierarchy is much simplified and modified not to use complex constraints nor Minimal Recursion Semantics (MRS).

PHON: A sequence of words governed by the phrase

SYNSEM

LOCAL

CAT

HEAD		Constraints inherited from the head daughter
		MOD: Constraints of a modifying phrase
		POSTHEAD: Whether this phrase modifies a preceding phrase

VAL		Subcategorization frame
		SUBJ: Constraints of left phrases that will be subcategorized
		COMPS: Constraints of right phrases that will be subcategorized
		SPR: Constraints of specifiers that will be subcategorized
		SPEC: Constraints of a specifying phrase
		CONJ: Constraints of conjuncts

CONT: Predicate-argument structure

CONX: Currently not used

NONLOCAL

Constraints of long-distance dependency

INHER		Constraints inherited from the daughters
		QUE: Constraints of wh-question words
		REL: Constraints of an antecedent phrase of a relative clause
		SLASH: Constraints of a phrase in long-distance dependencies
		F_REL: Constraints of free relatives

TO_BIND		Constraints bound in this phrase
		QUE: Constraints of wh-question words
		REL: Constraints of an antecedent phrase of a relative clause
		SLASH: Constraints of a phrase in long-distance dependencies
		F_REL: Constraints of free relatives

Constraints of phrases include various syntactic features (part-of-speech, agreement, tense, etc.).

CONT feature has a predicate-argument structure of the phrase. Predicate-argument structures represent relations of logical subject/object and modifying relations. The CONT feature of the sign of the top node shows the predicate-argument structure of the whole sentence.

Types and features used in the Enju grammar are all defined in "enju/types.lil". For details, see the source file.

System architecture

The Enju system uses UP (included in the MAYZ package), a general-purpose parser for unification grammars. UP parses a sentence with provided lexical entries and grammar rules. Enju creates the data passed to UP in the following way.

Architecture of
the grammar

An input sentence is passed to sentence_to_word_lattice/2, and converted to a word lattice, i.e., a list of extents (a pair of word position and word information). A word lattice is passed to UP, and parsing starts. sentence_to_word_lattice/2 first applies a POS tagger to an input sentence (external_tagger/2), splits it into words, and applies stemming to the words. sentence_to_word_lattice/2 is implemented in "enju/grammar.lil". By default, a POS tagger is "uptagger", and it can be changed by "-t" option of "enju" (or by initialize_external_tagger/2).

The predicate (lexical_entry/2 and lexical_entry_sign/2) makes lexical entries by using two databases. One is a mapping from a word/POS pair into a list of the names of lexical entry templates assigned to the word (lookup_lexicon/2). The other is a mapping from the name of a template into a feature structure of the template (lookup_template/2). Lexical entries are constructed by adding word-specific information (e.g. PHON feature) to lexical entry templates. This predicate is implemented in "enju/grammar.lil".

Grammar rules (i.e., schemas) are implemented in "enju/schema.lil". They define phrase structure rules to make a mother from its daughters.

Inside UP, parsing proceeds in the following steps.

preprocessing
lexicon lookup
kernel parsing

The preprocessor converts an input sentence into a word lattice by sentence_to_word_lattice/2. In the lexicon lookup, the parser uses the predicate (lexical_entry/2 and lexical_entry_sign/2) to find lexical entries for the word lattice. Finally, the parser does phrase analysis using the defined grammar rules in the kernel parsing process. UP uses data structures called edge and chart for phrase analysis.

Edge

Edge is a quadruple < left-position, right-position, sign, FOM (Figure of Merit)>, which represents a phrasal structure for the sub-string of the target sentence. Left-position and right-position are positions in the sentence. Sign is a phrasal structure represented as a feature structure. That is, an edge is a phrase structures that corresponds to a region. The edge also has an information about the score (FOM) for disambiguation of analysis.

Chart

The parser has a data structure generally called `chart.' The chart is a two dimensional table. Here, we call each cell in the table `CKY cell.' Let an input sentence s(= w1, w2, w3,...,wn), for example, w1 = "I", w2="saw", w3= "a", w4 = "girl", w5 = "with", w6 = "a", w7 = "telescope" for the sentence "I saw a girl with a telescope", the chart is arranged as follows.

Let S[i][j] be each cell in the chart. The suffixes i and j correspond to the left position and the right position in the sentence. For example, `edges' correspond to "I" are stored to S[0][1], edges for "saw" are stored to S[1][2], and edges for "telescope" are stored to S[6][7]. That is, the edges for word wi are stored to S[i-1][i]. The phrasal edges generated during parsing are stored to the chart. For example, edges for "I saw" are stored to S[0][2], edges for "saw a" are stored to S[1][3], and edges for wider phrase "I saw a" are stored to S[0][3]. Supposing that the sentence length is n, edges for the whole sentence are stored to S[0][n]. In the exmaple above, the edges for the whole sentence are store to S[0][7]. During parsing, an edge (mother) is generated by applying the grammar rules to two edges (daughters). For example, the results of applying the grammar rules to an edge in S[0][3] and an edge in S[3][5] are stored to S[0][5].

FOM

FOM of an edge represents how likely the edge is. When a mother edge is generated by applying a grammar rule, the FOM of the mother can be obtained by adding the score of the rule. Edges for the whole sentence region, which correspond to roots of parse trees, must satisfy the root condition. Of all the trees satisfying the condition, the one with the best FOM is taken as the final result.

Enju Manual Enju Home Page Tsujii Laboratory

MIYAO Yusuke (yusuke@is.s.u-tokyo.ac.jp)