Enju outputs phrase structures and predicate argument structures in an XML format. Phrase structures are tree structures that express how words are combined to form phrases and clauses. Predicate argument structures are a set of relations that describe semantic relations of words/phrases/clauses in a sentence.
For example, a phrase structure of the sentence "John has come." is described by the following tree structure:
(S (NP John) (VP has (VP come)))where "S", "NP", and "VP" are syntactic categories of phrases. In phrase structures, internal nodes are syntactic categories, and terminal nodes are words.
A predicate argument structure of the same sentence is given as the following three relations.
< come, arg1, John > < has, arg1, John > < has, arg2, come >
Each tuple represents a labeled relation (a labeled dependency from a word to a phrase/clause). For example, the first tuple means that "arg1" (the first argument, i.e., the semantic subject) of "come" is "John". The first element of a tuple expresses a predicate word, the second element is a label of the relation, and the third element is an argument phrase/clause of the predicate. Because relations represented by predicate argument structures are relations between a predicate and its arguments in a predicate logic, they do not necessarily correspond to syntactic head/argument. For example, when a prepositional phrase modifies a noun, the preposition is represented as a predicate and the noun is denoted as the "arg1" of the preposition, while, in syntax, the noun is the head and the prepositional phrase is the non-head.
Enju represents both structures in XML with three tags as listed below.
tag | attributes | |
---|---|---|
sentence | id parse_status fom | sentence |
cons | id cat xcat schema head sem_head | syntactic constituents (phrases, clauses, etc.) |
tok | id cat pos base tense aspect voice aux type lexentry pred arg1 arg2 arg3 arg4 mod | words, punctuations |
In general, "cons" represents phrase structures. Unique identifiers are assigned to every "cons" tag, and they are denoted by the "id" attribute. Syntactic categories of phrases and clauses, such as "NP" and "S", are expressed by the "cat" attribute, and finer classifications of categories are represented by the "xcat" attribute. Syntactic/semantic head daughters of internal nodes are expressed by the "head" and "sem_head" attributes, respectively. These attributes denote the identifier of "cons" of the head daughter. Terminal nodes of phrase structures are "tok" tags, and, like "cons", identifiers and syntactic categories are represented by the "id" and "cat" attributes. Additionally, Penn Treebank-style POS and a base form of a word are described by the "pos" and "base" attributes.
Predicate argument structures are expressed by attributes of the "tok" tag: "pred" denotes the type of the predicate for a word. "arg1", ..., "arg4", and "mod" denote identifiers of "cons", and they represent argument or modification phrases/clauses of the predicate. For example, when the value of "pred" is "verb_arg1" and the value of "arg1" is "c1", the first argument (i.e., semantic subject) is a phrase or a clause which is assigned the identifier "c1".
The other tag, "sentence", is used to express additional information of parsing results. A string processed by Enju as a sentence is bracketed by "sentence", and is assigned an identifier (the "id" attribute). The "parse_status" attribute represents whether parsing has succeeded or not. This attribute has "success" when parsing succeeded, and has a reason of the failure as listed below when failed.
parse_status | reason of failure |
---|---|
empty line | The input line is an empty string |
fragmental parse | Although the parser could not produce an analysis that spans the whole sentence, it outputs fragmental parse results |
no successful parse | The parser could not produce an analysis |
POS tagging error | Tokenization or POS tagging failed or returned an ill-formed string |
lexical entry assignment error | Assignment of lexical entries failed (e.g. caused by wrong POS tags) |
sentence length limit exceeded | The number of words was larger than the limit (enlarge the limit of sentence length to parse those sentences) |
edge number limit exceeded | The number of produced edges exceeded the limit (enlarge the limit of edge number to parse those sentences) |
parser setup error | Set-up of internal data structures failed (e.g. failure of memory allocation) |
XML encoding error | Something wrong happened when encoding a parse result into an XML format (contact the developer when you find this error) |
unknown error | Error caused by an unknown reason (contact the developer when you find this error) |
fatal error | An unrecoverable error occurred (contact the developer when you find this error) |
The "fom" attribute denotes a figure-of-merit, which is a score of goodness of the parsing result.
Here is an example XML-style output of the parsing of sentence "John has come":
<sentence id="s0" parse_status="success" fom="7.28"> <cons id="c0" cat="S" xcat="" head="c3" sem_head="c3"> <cons id="c1" cat="NP" xcat="" head="c2" sem_head="c2"> <cons id="c2" cat="NX" xcat="" head="t0" sem_head="t0" > <tok id="t0" cat="N" pos="NNP" base="john" pred="noun_arg0"> John </tok> </cons> </cons> <cons id="c3" cat="VP" xcat="" head="c4" sem_head="c5"> <cons id="c4" cat="VX" xcat="" head="t1" sem_head="t1"> <tok id="t1" cat="V" pos="VBZ" base="have" pred="aux_arg12" arg1="c1" arg2="c5"> has </tok> <cons id="c5" cat="VP" xcat="" head="t2" sem_head="t2"> <tok id="t2" cat="V" pos="VBN" base="come" pred="verb_arg1" arg1="c1"> come </tok> </cons> </cons> </cons> </sentence>
Linguistic meanings of syntactic categories and predicate-argument relations will be explained in Section 2 and 3.
Other linguistic features, such as applied schema, tense, and aspect, are output as attributes of the "cons" and "tok" tags, although they are omitted from the example above. These are explained in Section 4.
Phrase structures are represented by two tags, "cons" and "tok": "cons" expresses internal nodes, and "tok" denotes terminal nodes of tree structures. In general, "cons" corresponds to phrases and clauses, which are called "constituents", while "tok" corresponds to words and punctuations, which are called "tokens". All "cons" and "tok" tags are assigned unique identifiers, which will be used to express various linguistic relations by identifying targets of the relations. Attributes of these tags represent syntactic information of annotated constituents.
id | identifier |
cat | syntactic category |
xcat | extra features of syntactic category |
head | ID of the syntactic head daughter |
sem_head | ID of the semantic head daughter |
id | identifier |
cat | syntactic category |
pos | Penn Treebank-style part-of-speech tag |
base | base form |
A tree of syntactic constituents is expressed by nested "cons" tags. In the output of Enju, all trees have only binary or unary branchings; that is, each "cons" tag covers at most two "cons" tags in it.
Syntactic categories of constituents are expressed by the attribute, "cat". A value of "cat" is a concatenation of a POS (e.g., ADJ) and a suffix which indicates whether a constituent is a saturated phrase (expressed by "P") or an unsaturated constituent ("X"). A list of POSs is given below.
ADJ | Adjective |
ADV | Adverb |
CONJ | Coordination conjunction |
C | Complementizer |
D | Determiner |
N | Noun |
P | Preposition |
SC | Subordination conjunction |
V | Verb |
Additional symbols express other types of constituents. They are used as values of "cat" without suffixes.
COOD | Part of coordination |
PN | Punctuation |
PRT | Particle |
S | Sentence |
Syntactic categories of tokens are also specified as values of "cat" of "tok" tags. The values are the same as the above, while, in this case, suffixes are not expressed. Additional attributes, "pos" and "base", represent morphological information: a Penn Treebank-style POS tag and the base form of the input string.
For example, "cat" of "tok" of the word "John" is "N", because this word is a noun. The value of "pos" is "NNP", which means a singular proper noun. It also constitutes a nominal constituent without a determiner, and therefore is assigned "NX" as a value of "cons". Furthermore, since this constituent can take an empty determiner to become a noun phrase, it is also assigned another "cons" tag with "NP" as the value of "cat".
Since symbols of "cat" are sometimes very coarse, "xcat" expresses important linguistic distinctions. The value of "xcat" is a space-separated set of the following values.
COOD | Coordinated phrase/clause |
IMP | Imperative sentence |
INV | Subject-verb inversion |
Q | Interrogative sentence with subject-verb inversion |
REL | A relativizer included |
FREL | A free relative included |
TRACE | A trace included |
WH | A wh-question word included |
The value of "head" or "sem_head" is an identifier of one of its daughters. They indicate the identifier of the head daughter of the phrase.
The syntactic head of a constituent is denoted by "head", and is a daughter constituent that determines syntactic characteristics of the constituent. Usually, the syntactic head of "X phrase" (X=verb, noun, adjective, etc.) is X. The syntactic head of a sentence is a main verb phrase.
The semantic head of a phrase is denoted by "sem_head", and is a daughter constituent that mainly conveys a semantic content of the constituent. That is, function words are not semantic heads even when they are syntactic heads, while content words are syntactic and semantic heads. Actually, in the current implementation of Enju, "head" and "sem_head" are different in the following cases.
Predicate argument structures are expressed by attributes of "tok": "pred", "arg1", ..., "arg4", and "mod". "pred" denotes the type of a predicate. The others denote identifiers of constituents. For example, when the value of "pred" is "verb_arg1" and the value of "arg1" is "c1", the first argument (i.e., semantic subject) is a phrase or a clause identified by "c1".
A value of "pred" is one of the following values. It is a concatenation of a POS (e.g., noun, verb) and a symbol (or symbols) which indicates required arguments.
Attribute | Values | |
---|---|---|
pred | noun_arg0, noun_arg1, noun_arg2, noun_arg12, it_arg1, there_arg0, quote_arg2, quote_arg12, quote_arg23, quote_arg123, poss_arg2, poss_arg12, aux_arg12, aux_mod_arg12, verb_arg1, verb_arg12, verb_arg123, verb_arg1234, verb_mod_arg1, verb_mod_arg12, verb_mod_arg123, verb_mod_arg1234, adj_arg1, adj_arg12, adj_mod_arg1, adj_mod_arg12, conj_arg1, conj_arg12, conj_arg123, coord_arg12, det_arg1, prep_arg12, prep_arg123, prep_mod_arg12, prep_mod_arg123, lgs_arg2, dtv_arg2, punct_arg1, app_arg12, lparen_arg123, rparen_arg0, comp_arg1, comp_arg12, comp_mod_arg1, relative_arg1, relative_arg12 | predicate type |
Argument numbers ("X" in "argX") are assigned in the order of surface realizations in declarative sentences. For nouns, verbs, adjectives, adverbs, and prepositions, "arg1" is assigned to a left argument, and "arg2", ..., "arg4" are assigned to right arguments in a left-to-right order. "mod" is assigned to a modifiee of VP modifiers (e.g. a matrix clause of participial construction). For complementizers and determiners, their dependent phrases/clauses will be "arg1". The complement of "'s" (e.g. "John" in "John 's") is expressed as "arg2". For punctuations, and particles, their dependent phrases are denoted by "arg1". For subordination/coordination conjunctions, main/left conjuncts are represented by "arg1", and the other conjuncts are expressed as "arg2".
For example, "A beautiful butterfly is coming into my room." has following predicate argument relations.
< coming, arg1, A beautiful butterfly > < is, arg1, A beautiful butterfly > < is, arg2, coming into my room > < beautiful, arg1, butterfly > < a, arg1, beautiful butterfly > < into, arg1, coming > < into, arg2, my room > < my, arg1, room >
In the XML format, each relation is expressed by an attribute of "tok". For example, when we suppose the identifier of the phrase "A beautiful butterfly" is "c1", the XML annotation for "coming" will be like this:
<tok id="t1" cat="V" pos="VBG" base="come" pred="verb_arg1" arg1="c1" > coming </tok>
Because the grammar of Enju has rich linguistic information, a part of it can be output additionally. The following attributes are added to "cons" or "tok" tags.
Attribute | Values | |
---|---|---|
schema | subj_head, head_comp, spec_head, head_mod, mod_head, filler_head, head_relative, coord_left, coord_right, empty_filler_head, empty_spec_head, free_relative | applied schema |
tense | untensed, past, present | tense of a verb |
aspect | none, perfect, progressive, perfect-progressive | aspect of a verb |
voice | active, passive | voice of a verb |
aux | minus, modal, have, be, do, copular | auxiliary verb or not |
type | pred, noun_mod, verb_mod, adj_mod, prep_mod, other_mod, pred_mod | syntactic type |
lexentry | (see below) | assigned lexical entry |
All "cons" tags except preterminals (i.e., "cons" tags immediately above "tok") have non-empty values for "schema". All verbs have "aux" attributes, while principal verbs (i.e., aux="minus" or aux="copular") have "tense", "aspect", and "voice", whose values are non-empty strings. The "type" attribute expresses the syntactic type of a word: "pred" means predicative, "noun_mod", "verb_mod", "adj_mod", "prep_mod", and "other_mod" mean modifiers to nouns/verbs/adjectives/prepositions/other words, respectively, and "pred_mod" means a predicative modifier. If no "type" is assigned, it indicates that the word is an argument or the head of the sentence.
All "tok" tags have "lexentry". A value of "lexentry" is a lexeme name and applied lexical rules concatenated by hyphens. For example, when the value of "lexentry" is "[NP.nom<V.bse>NP.acc]_lxm-singular3rd_verb_rule" the lexical entry is obtained from the lexeme "[NP.nom<V.bse>NP.acc]_lxm" by applying the rule "singular3rd_verb_rule".
Here is a rough sketch of correspondences of syntactic categories of Enju and Penn Treebank (PTB). It should be noted that this table does not necessarily mean that outputs of Enju can be formally translated into PTB-style outputs. Because the output of Enju is based on HPSG and it is different from the annotation policy of PTB, tree structures and/or syntactic categories are often different from those given by the PTB-style annotation. However, these mappings provide a clear image of what Enju expresses.
Enju | PTB | |
---|---|---|
cat | xcat | |
ADJP | ADJP, QP (number expression) | |
ADJP | REL | WHADJP (relativizer) |
ADJP | FREL | WHADJP (free relative) |
ADJP | WH | WHADJP (wh-phrase) |
ADVP | ADVP | |
ADVP | REL | WHADVP (relativizer) |
ADVP | FREL | WHADVP (free relative) |
ADVP | WH | WHADVP (wh-phrase) |
CONJP | CONJP | |
CP | SBAR (complementizer phrase) | |
DP | NP (possessive), QP (quantifier) | |
NP | NP | |
NX | NX, NAC | |
NP | REL | WHNP (relativizer) |
NP | FREL | WHNP (free relative) |
NP | WH | WHNP (wh-phrase) |
PP | PP | |
PP | REL | WHPP (relativizer) |
PP | WH | WHPP (wh-phrase) |
PRT | PRT | |
S | S | |
S | INV | SINV |
S | Q | SQ |
S | REL | SBAR (relative clause) |
S | FREL | SBAR (free relative clause) |
S | WH | SBARQ |
SCP | SBAR (subordinate clause) | |
VP | VP, RRC | |
PRN, INTJ, LST, X |