Enju 2.4 Output Specifications

Yusuke Miyao (National Institute of Informatics, Japan)

Last updated: 18/Jun/2010

Overview
Phrase structures
Predicate argument structures
HPSG-specific features
Appendix: Correspondences between Enju and PTB

1. Overview

Enju outputs phrase structures and predicate argument structures in an XML format. Phrase structures are tree structures that express how words are combined to form phrases and clauses. Predicate argument structures are a set of relations that describe semantic relations of words/phrases/clauses in a sentence.

For example, a phrase structure of the sentence "John has come." is described by the following tree structure:

(S (NP John) (VP has (VP come)))

where "S", "NP", and "VP" are syntactic categories of phrases. In phrase structures, internal nodes are syntactic categories, and terminal nodes are words.

A predicate argument structure of the same sentence is given as the following three relations.

< come, arg1, John >
< has, arg1, John >
< has, arg2, come >

Each tuple represents a labeled relation (a labeled dependency from a word to a phrase/clause). For example, the first tuple means that "arg1" (the first argument, i.e., the semantic subject) of "come" is "John". The first element of a tuple expresses a predicate word, the second element is a label of the relation, and the third element is an argument phrase/clause of the predicate. Because relations represented by predicate argument structures are relations between a predicate and its arguments in a predicate logic, they do not necessarily correspond to syntactic head/argument. For example, when a prepositional phrase modifies a noun, the preposition is represented as a predicate and the noun is denoted as the "arg1" of the preposition, while, in syntax, the noun is the head and the prepositional phrase is the non-head.

Enju represents both structures in XML with three tags as listed below.

tag attributes

sentence id parse_status fom sentence

cons id cat xcat schema head sem_head syntactic constituents (phrases, clauses, etc.)

tok id cat pos base tense aspect voice aux type lexentry pred arg1 arg2 arg3 arg4 mod words, punctuations

tag	attributes
sentence	id parse_status fom	sentence
cons	id cat xcat schema head sem_head	syntactic constituents (phrases, clauses, etc.)
tok	id cat pos base tense aspect voice aux type lexentry pred arg1 arg2 arg3 arg4 mod	words, punctuations

In general, "cons" represents phrase structures. Unique identifiers are assigned to every "cons" tag, and they are denoted by the "id" attribute. Syntactic categories of phrases and clauses, such as "NP" and "S", are expressed by the "cat" attribute, and finer classifications of categories are represented by the "xcat" attribute. Syntactic/semantic head daughters of internal nodes are expressed by the "head" and "sem_head" attributes, respectively. These attributes denote the identifier of "cons" of the head daughter. Terminal nodes of phrase structures are "tok" tags, and, like "cons", identifiers and syntactic categories are represented by the "id" and "cat" attributes. Additionally, Penn Treebank-style POS and a base form of a word are described by the "pos" and "base" attributes.

Predicate argument structures are expressed by attributes of the "tok" tag: "pred" denotes the type of the predicate for a word. "arg1", ..., "arg4", and "mod" denote identifiers of "cons", and they represent argument or modification phrases/clauses of the predicate. For example, when the value of "pred" is "verb_arg1" and the value of "arg1" is "c1", the first argument (i.e., semantic subject) is a phrase or a clause which is assigned the identifier "c1".

The other tag, "sentence", is used to express additional information of parsing results. A string processed by Enju as a sentence is bracketed by "sentence", and is assigned an identifier (the "id" attribute). The "parse_status" attribute represents whether parsing has succeeded or not. This attribute has "success" when parsing succeeded, and has a reason of the failure as listed below when failed.

parse_status reason of failure

empty line The input line is an empty string

fragmental parse Although the parser could not produce an analysis that spans the whole sentence, it outputs fragmental parse results

no successful parse The parser could not produce an analysis

POS tagging error Tokenization or POS tagging failed or returned an ill-formed string

lexical entry assignment error Assignment of lexical entries failed (e.g. caused by wrong POS tags)

sentence length limit exceeded The number of words was larger than the limit (enlarge the limit of sentence length to parse those sentences)

edge number limit exceeded The number of produced edges exceeded the limit (enlarge the limit of edge number to parse those sentences)

parser setup error Set-up of internal data structures failed (e.g. failure of memory allocation)

XML encoding error Something wrong happened when encoding a parse result into an XML format (contact the developer when you find this error)

unknown error Error caused by an unknown reason (contact the developer when you find this error)

fatal error An unrecoverable error occurred (contact the developer when you find this error)

parse_status	reason of failure
empty line	The input line is an empty string
fragmental parse	Although the parser could not produce an analysis that spans the whole sentence, it outputs fragmental parse results
no successful parse	The parser could not produce an analysis
POS tagging error	Tokenization or POS tagging failed or returned an ill-formed string
lexical entry assignment error	Assignment of lexical entries failed (e.g. caused by wrong POS tags)
sentence length limit exceeded	The number of words was larger than the limit (enlarge the limit of sentence length to parse those sentences)
edge number limit exceeded	The number of produced edges exceeded the limit (enlarge the limit of edge number to parse those sentences)
parser setup error	Set-up of internal data structures failed (e.g. failure of memory allocation)
XML encoding error	Something wrong happened when encoding a parse result into an XML format (contact the developer when you find this error)
unknown error	Error caused by an unknown reason (contact the developer when you find this error)
fatal error	An unrecoverable error occurred (contact the developer when you find this error)

The "fom" attribute denotes a figure-of-merit, which is a score of goodness of the parsing result.

Here is an example XML-style output of the parsing of sentence "John has come":

<sentence id="s0" parse_status="success" fom="7.28">
  <cons id="c0" cat="S" xcat="" head="c3" sem_head="c3">
    <cons id="c1" cat="NP" xcat="" head="c2" sem_head="c2">
      <cons id="c2" cat="NX" xcat="" head="t0" sem_head="t0" >
        <tok id="t0" cat="N" pos="NNP" base="john" pred="noun_arg0">
          John
        </tok>
      </cons>
    </cons>
    <cons id="c3" cat="VP" xcat="" head="c4" sem_head="c5">
      <cons id="c4" cat="VX" xcat="" head="t1" sem_head="t1">
        <tok id="t1" cat="V" pos="VBZ" base="have" pred="aux_arg12" arg1="c1" arg2="c5">
          has
        </tok>
      <cons id="c5" cat="VP" xcat="" head="t2" sem_head="t2">
        <tok id="t2" cat="V" pos="VBN" base="come" pred="verb_arg1" arg1="c1">
          come
        </tok>
      </cons>
    </cons>
  </cons>
</sentence>

Linguistic meanings of syntactic categories and predicate-argument relations will be explained in Section 2 and 3.

Other linguistic features, such as applied schema, tense, and aspect, are output as attributes of the "cons" and "tok" tags, although they are omitted from the example above. These are explained in Section 4.

2. Phrase structures

Phrase structures are represented by two tags, "cons" and "tok": "cons" expresses internal nodes, and "tok" denotes terminal nodes of tree structures. In general, "cons" corresponds to phrases and clauses, which are called "constituents", while "tok" corresponds to words and punctuations, which are called "tokens". All "cons" and "tok" tags are assigned unique identifiers, which will be used to express various linguistic relations by identifying targets of the relations. Attributes of these tags represent syntactic information of annotated constituents.

Attributes of "cons"
id identifier

cat syntactic category

xcat extra features of syntactic category

head ID of the syntactic head daughter

sem_head ID of the semantic head daughter

Attributes of "cons"
id	identifier
cat	syntactic category
xcat	extra features of syntactic category
head	ID of the syntactic head daughter
sem_head	ID of the semantic head daughter

Attributes of "tok"
id identifier

cat syntactic category

pos Penn Treebank-style part-of-speech tag

base base form

Attributes of "tok"
id	identifier
cat	syntactic category
pos	Penn Treebank-style part-of-speech tag
base	base form

A tree of syntactic constituents is expressed by nested "cons" tags. In the output of Enju, all trees have only binary or unary branchings; that is, each "cons" tag covers at most two "cons" tags in it.

Syntactic categories of constituents are expressed by the attribute, "cat". A value of "cat" is a concatenation of a POS (e.g., ADJ) and a suffix which indicates whether a constituent is a saturated phrase (expressed by "P") or an unsaturated constituent ("X"). A list of POSs is given below.

ADJ Adjective

ADV Adverb

CONJ Coordination conjunction

C Complementizer

D Determiner

N Noun

P Preposition

SC Subordination conjunction

V Verb

Additional symbols express other types of constituents. They are used as values of "cat" without suffixes.

COOD Part of coordination

PN Punctuation

PRT Particle

S Sentence

Syntactic categories of tokens are also specified as values of "cat" of "tok" tags. The values are the same as the above, while, in this case, suffixes are not expressed. Additional attributes, "pos" and "base", represent morphological information: a Penn Treebank-style POS tag and the base form of the input string.

For example, "cat" of "tok" of the word "John" is "N", because this word is a noun. The value of "pos" is "NNP", which means a singular proper noun. It also constitutes a nominal constituent without a determiner, and therefore is assigned "NX" as a value of "cons". Furthermore, since this constituent can take an empty determiner to become a noun phrase, it is also assigned another "cons" tag with "NP" as the value of "cat".

Since symbols of "cat" are sometimes very coarse, "xcat" expresses important linguistic distinctions. The value of "xcat" is a space-separated set of the following values.

COOD Coordinated phrase/clause

IMP Imperative sentence

INV Subject-verb inversion

Q Interrogative sentence with subject-verb inversion

REL A relativizer included

FREL A free relative included

TRACE A trace included

WH A wh-question word included

The value of "head" or "sem_head" is an identifier of one of its daughters. They indicate the identifier of the head daughter of the phrase.

The syntactic head of a constituent is denoted by "head", and is a daughter constituent that determines syntactic characteristics of the constituent. Usually, the syntactic head of "X phrase" (X=verb, noun, adjective, etc.) is X. The syntactic head of a sentence is a main verb phrase.

The semantic head of a phrase is denoted by "sem_head", and is a daughter constituent that mainly conveys a semantic content of the constituent. That is, function words are not semantic heads even when they are syntactic heads, while content words are syntactic and semantic heads. Actually, in the current implementation of Enju, "head" and "sem_head" are different in the following cases.

Auxiliary construction (including to-infinitive): "head" is the auxiliary verb, while "sem_head" is the main verb (see the example in Section 1).
Complementizer phrase (that, whether, for): "head" is the complementizer, while "sem_head" is the clause.
Quotation: "head" is the quotation mark, while "sem_head" is the content of the quotation.
Passive "by" phrase: "head" is "by", while "sem_head" is the object of the "by" phrase.
Dative "to" phrase: "head" is "to", while "sem_head" is the object of the "to" phrase.

In other cases, "head" is identical to "sem_head".

3. Predicate argument structures

Predicate argument structures are expressed by attributes of "tok": "pred", "arg1", ..., "arg4", and "mod". "pred" denotes the type of a predicate. The others denote identifiers of constituents. For example, when the value of "pred" is "verb_arg1" and the value of "arg1" is "c1", the first argument (i.e., semantic subject) is a phrase or a clause identified by "c1".

A value of "pred" is one of the following values. It is a concatenation of a POS (e.g., noun, verb) and a symbol (or symbols) which indicates required arguments.

Attribute Values

pred noun_arg0, noun_arg1, noun_arg2, noun_arg12, it_arg1, there_arg0, quote_arg2, quote_arg12, quote_arg23, quote_arg123, poss_arg2, poss_arg12, aux_arg12, aux_mod_arg12, verb_arg1, verb_arg12, verb_arg123, verb_arg1234, verb_mod_arg1, verb_mod_arg12, verb_mod_arg123, verb_mod_arg1234, adj_arg1, adj_arg12, adj_mod_arg1, adj_mod_arg12, conj_arg1, conj_arg12, conj_arg123, coord_arg12, det_arg1, prep_arg12, prep_arg123, prep_mod_arg12, prep_mod_arg123, lgs_arg2, dtv_arg2, punct_arg1, app_arg12, lparen_arg123, rparen_arg0, comp_arg1, comp_arg12, comp_mod_arg1, relative_arg1, relative_arg12 predicate type

Attribute	Values
pred	noun_arg0, noun_arg1, noun_arg2, noun_arg12, it_arg1, there_arg0, quote_arg2, quote_arg12, quote_arg23, quote_arg123, poss_arg2, poss_arg12, aux_arg12, aux_mod_arg12, verb_arg1, verb_arg12, verb_arg123, verb_arg1234, verb_mod_arg1, verb_mod_arg12, verb_mod_arg123, verb_mod_arg1234, adj_arg1, adj_arg12, adj_mod_arg1, adj_mod_arg12, conj_arg1, conj_arg12, conj_arg123, coord_arg12, det_arg1, prep_arg12, prep_arg123, prep_mod_arg12, prep_mod_arg123, lgs_arg2, dtv_arg2, punct_arg1, app_arg12, lparen_arg123, rparen_arg0, comp_arg1, comp_arg12, comp_mod_arg1, relative_arg1, relative_arg12	predicate type

Argument numbers ("X" in "argX") are assigned in the order of surface realizations in declarative sentences. For nouns, verbs, adjectives, adverbs, and prepositions, "arg1" is assigned to a left argument, and "arg2", ..., "arg4" are assigned to right arguments in a left-to-right order. "mod" is assigned to a modifiee of VP modifiers (e.g. a matrix clause of participial construction). For complementizers and determiners, their dependent phrases/clauses will be "arg1". The complement of "'s" (e.g. "John" in "John 's") is expressed as "arg2". For punctuations, and particles, their dependent phrases are denoted by "arg1". For subordination/coordination conjunctions, main/left conjuncts are represented by "arg1", and the other conjuncts are expressed as "arg2".

For example, "A beautiful butterfly is coming into my room." has following predicate argument relations.

< coming, arg1, A beautiful butterfly >
< is, arg1, A beautiful butterfly >
< is, arg2, coming into my room >
< beautiful, arg1, butterfly >
< a, arg1, beautiful butterfly >
< into, arg1, coming >
< into, arg2, my room >
< my, arg1, room >

In the XML format, each relation is expressed by an attribute of "tok". For example, when we suppose the identifier of the phrase "A beautiful butterfly" is "c1", the XML annotation for "coming" will be like this:

<tok id="t1" cat="V" pos="VBG" base="come" pred="verb_arg1" arg1="c1" >
  coming
</tok>

4. HPSG-specific features

Because the grammar of Enju has rich linguistic information, a part of it can be output additionally. The following attributes are added to "cons" or "tok" tags.

Attributes for "cons"
Attribute Values

schema subj_head, head_comp, spec_head, head_mod, mod_head, filler_head, head_relative, coord_left, coord_right, empty_filler_head, empty_spec_head, free_relative applied schema

Attributes for "cons"
Attribute	Values
schema	subj_head, head_comp, spec_head, head_mod, mod_head, filler_head, head_relative, coord_left, coord_right, empty_filler_head, empty_spec_head, free_relative	applied schema

Attributes for "tok"
tense untensed, past, present tense of a verb

aspect none, perfect, progressive, perfect-progressive aspect of a verb

voice active, passive voice of a verb

aux minus, modal, have, be, do, copular auxiliary verb or not

type pred, noun_mod, verb_mod, adj_mod, prep_mod, other_mod, pred_mod syntactic type

lexentry (see below) assigned lexical entry

Attributes for "tok"
tense	untensed, past, present	tense of a verb
aspect	none, perfect, progressive, perfect-progressive	aspect of a verb
voice	active, passive	voice of a verb
aux	minus, modal, have, be, do, copular	auxiliary verb or not
type	pred, noun_mod, verb_mod, adj_mod, prep_mod, other_mod, pred_mod	syntactic type
lexentry	(see below)	assigned lexical entry

All "cons" tags except preterminals (i.e., "cons" tags immediately above "tok") have non-empty values for "schema". All verbs have "aux" attributes, while principal verbs (i.e., aux="minus" or aux="copular") have "tense", "aspect", and "voice", whose values are non-empty strings. The "type" attribute expresses the syntactic type of a word: "pred" means predicative, "noun_mod", "verb_mod", "adj_mod", "prep_mod", and "other_mod" mean modifiers to nouns/verbs/adjectives/prepositions/other words, respectively, and "pred_mod" means a predicative modifier. If no "type" is assigned, it indicates that the word is an argument or the head of the sentence.

All "tok" tags have "lexentry". A value of "lexentry" is a lexeme name and applied lexical rules concatenated by hyphens. For example, when the value of "lexentry" is "[NP.nom<V.bse>NP.acc]_lxm-singular3rd_verb_rule" the lexical entry is obtained from the lexeme "[NP.nom<V.bse>NP.acc]_lxm" by applying the rule "singular3rd_verb_rule".

5. Appendix: Correspondences between Enju and PTB

Here is a rough sketch of correspondences of syntactic categories of Enju and Penn Treebank (PTB). It should be noted that this table does not necessarily mean that outputs of Enju can be formally translated into PTB-style outputs. Because the output of Enju is based on HPSG and it is different from the annotation policy of PTB, tree structures and/or syntactic categories are often different from those given by the PTB-style annotation. However, these mappings provide a clear image of what Enju expresses.

Enju PTB

cat xcat

ADJP
ADJP, QP (number expression)

ADJP REL WHADJP (relativizer)

ADJP FREL WHADJP (free relative)

ADJP WH WHADJP (wh-phrase)

ADVP
ADVP

ADVP REL WHADVP (relativizer)

ADVP FREL WHADVP (free relative)

ADVP WH WHADVP (wh-phrase)

CONJP
CONJP

CP
SBAR (complementizer phrase)

DP
NP (possessive), QP (quantifier)

NP
NP

NX
NX, NAC

NP REL WHNP (relativizer)

NP FREL WHNP (free relative)

NP WH WHNP (wh-phrase)

PP
PP

PP REL WHPP (relativizer)

PP WH WHPP (wh-phrase)

PRT
PRT

S
S

S INV SINV

S Q SQ

S REL SBAR (relative clause)

S FREL SBAR (free relative clause)

S WH SBARQ

SCP
SBAR (subordinate clause)

VP
VP, RRC

PRN, INTJ, LST, X

ADJ	Adjective
ADV	Adverb
CONJ	Coordination conjunction
C	Complementizer
D	Determiner
N	Noun
P	Preposition
SC	Subordination conjunction
V	Verb

COOD	Coordinated phrase/clause
IMP	Imperative sentence
INV	Subject-verb inversion
Q	Interrogative sentence with subject-verb inversion
REL	A relativizer included
FREL	A free relative included
TRACE	A trace included
WH	A wh-question word included