Corpus transformation

Japanese version

The grammr of ENJU is developed by tranforming phrase structure trees of Penn Treebank into HPSG-style phrase structure trees. This transformation is done with the treetrans tool of mayz. For more details on treetrans, please refer to the manual of mayz.

treetrans rule module input file output database
rule module	lilfes file that contains the rules for transforming phrase structure trees
input file	treebank in(text format)
output database	treebank out(lildb format)

In the case of ENJU grammar, all lines are filled by text files which contain the phrase structure tree of Penn Treebank. The input file line would be as follows:

(S (NP-SBJ Ms./NNP Haag/NNP) (VP plays/VBZ (NP Elianti/NNP)) ./.)

treetrans would read the above text. The generated phrase structure tree would be represented by feature structures defined in the file "treetypes.lil" that comes with mayz. Next, the feature structures would be further transformed by pattern rules in the rule module and supplemented with additional information.

The processing of phrase structure trees is actually broken down into the following steps:

Supply a Phrase Structure Tree as Input
Pre-processing
Transformating by Pattern Rules

Supply a Phrase Structure Tree as Input

A phrase structure tree represented as a feature structure is constructed out of a phrase structure tree supplied in text format. To do this, call the predicate input_parse_tree/2. Supply it with the line bearing the heading "input file".

In the case of ENJU, the treebank used as input is the Penn Treebank, which can be handled by the input_ptb_parse_tree/2 predicate comes with mayz. The input_ptb_parse_tree/2 predicate is called as a sub-clause of the input_parse_tree/2 of ENJU.

The input_ptb_parse_tree/2 predicate only converts the input tree to feature structures without doing any real change to the phrase structure of the tree. Leaf nodes can be changed with the following predicates: 語ptb_empty_category/1，ptb_preprocess_pos/2，ptb_delete_pos/1，ptb_preprocess_word/2. What these predicates do is given as follows:

POS tags specified by ptb_empty_category/1 would be considered as empty cateogories and assigned the 'tree_empty' type.
Pre-processing of POS tags is done by the ptb_preprocess_pos/2 predicate
Pre-processed POS tags specified by the ptb_delete_pos/1 predicate would be ignored. No feature structure would be generated for tree nodes bearing such tags.
Nodes that are not marked as ignored are pre-processed by the ptb_preprocess_word/2 predicate.

In the case of ENJU, the following is specified for a phrase structure tree supplied as the input

The POS tag "-NONE-" corresponds to the empty category.
Make the preprocessing of tags during grammar extraction the same as the preprocessing of tags during parsing. (e.g. Changing the POS tag "." to "-PERIOD-")
Making the tags ignored during grammar extraction the same as the tags ignored during parsing. (e.g. Ignoring the "-PERIOD-" tag)

Making the preprocessing of the input string the same as the preprocessing of the input string. (e.g. Assigning "-YEAR-" to any 4-digit number)

Preprocessing

During preprocessing, phrase structure trees are reshaped before applying pattern rules.

Following a breath first approach, it does the following with each node of a phrase structure tree:

If there is any partial tree (of the type 'tree') below the node of the feature structures specified by the delete_tree/t predicate and such tree(s) is/are unifiable withe feature structure of the node, all the partial trees are deleted. Deleted subtrees will not undergo the following processes:
If the label of a non-terminal node is specified by the nonterminal_mapping/2 predicate, the label would be converted to something else.
In the case of leaves

If conversion of the input string and the assigned peterminal symbol is specified by the preterminalmapping/4 predicate, proceed with conversion.
If conversion has not been done by the preterminal_mapping predicate and the preterminal symbol of the current node corresponds to nonterminal symbols specified by the predicate, the specified nonterminal symbols are added to the above of the current node.

delete_tree/1， nonterminal_mapping/2，preterminal_mapping/4， preterminal_mapping/4 are interface predicates.

They specify the following things:

which tree to be protected from being deleted
conversion of nonterminal symbls (e.g. from "NAC" to "NP")
conversion of preterminal symbols (e.g. from "%/NN" to "%/%")

These are done in the following manner

nonterminal_mapping("NAC", "NP").
preterminal_mapping("%", "NN", "%", "%").

These interfaces are called by the "devel/transmain.lil" predicate.

Transformation by Pattern Rules

Pattern rules are applied to preprocessed phrase structure trees. Transformed trees are stored in the output database.

The objective is to construct HPSG-style phrase structure trees from the transformed phrase structure trees. Codess like the following are included as one of the pattern rules:

Fixing Erros in the Penn Treebank: For example, changing (PP ***/RP XXX) to (PP ***/IN XXX)
Specifying the Structure of the "than"-construction: That is, changing (... than/IN XXX) to (... (PP than/IN XXX:argument))
Determining the Head of all Non-terminal Nodes
Converting to Binary Trees Based on Headed Structures.
Applying Syntactic Rules to Nodes: For example, applying the Head-Subject schema to (X Y:arg Z:head).

Pattern rules used for transformation are defined by the following interface predicates: tree_transform_class/3，tree_ignore/2， tree_transform_rule/3，tree_subst_pattern/3， tree_unify/2，tree_match_pattern/2. The first thing to do is the declaration of pattern rules. The name of the relevant rule, the order of rule application and the operation after rule application are specified by the tree_transform_class/3 predicate. Patterns rules are applied in the order they are declareded.

`tree_transform_class(+$Name, +$Direction, +$Strict)`
`+$Name`	Name of a Pattern Rule
`+$Direction`	The Order of Rule Application "topdown": From a node at the top to a node at the bottom "bottomup": From a node at the bottom to a node at the top "rootonly": Only to the root node
`+$Strict`	Operation after Rule Application "strict": If application fails, fail the transformation "weak": If application fails, carry on with the next rule "exhaustive": If application succeed, reapply the rule until it fails

The kind of processing done by a pattern rule is specified by the following predicates' tree_ignore/2， tree_transform_rule/3，tree_subst_pattern/3， tree_unify/2, tree_match_pattern/2. The processing of a rule by these predicates is as follows:

If the tree under processing is unifiable with the partial tree specified by the tree_ignore/2 predicate, delete the tree under processing.
If the tree under processing satisfies the contraints given by the tree_transform_rule/3 predicate, substitute with the partial tree returned by the tree_transform_rule/3.
If the tree under processing matches the pattern given by the tree_subst_pattern/3 predicate, substitute with the pattern returned by the tree_subst_pattern/3.
Unify the tree under processing with the partial tree given by the tree_unify/2 predicate.
Match the tree under processing with the pattern given by the tree_match_pattern/2 predicate.

If a predicate given above succeeds once, ENJU regards the application of the corresponding rule as successful. If 1. succeeds, 2. would not be executed. If the last clause fails, the application of the corresponding pattern rule is regarded as failed.

ENJU defines pattern rules in the following manner. This rule specifies the structure of a "than" phrase. It is a pattern rule that transform bracketing information given as (... than/IN XXX) to (... (PP than/IN XXX:argument)).

After completion of tranformation, check whether the correct value is assigned to the TREENODE\SCHEMANAME\ feature of each node in the relevant phrase structure tree. This feature gives the name of the schema applied to the daughter of the current node. This information would be extracted in the next step.

tree_transform_class("than", "topdown", "weak").
tree_subst_pattern("than",
		   TREE_NODE\$Node & TREE_DTRS\$Dtrs,
		   TREE_NODE\$Node & TREE_DTRS\$NewDtrs) :-
    $Dtrs = [$Left & tree_any & ANY_TREES\[_|_],
	     $Than & tree & TREE_NODE\(SYM\"IN" & WORD\SURFACE\"than"),
	     $Right & tree & TREE_NODE\HEAD_MARK\argument],
    $NewDtrs = [$Left,
		TREE_NODE\(SYM\"PP" & FUNC\[] & ID\[] & HEAD_MARK\modifier_non_empty) &
		TREE_DTRS\[$Than, $Right]].

Enju Developers' Manual Enju Home Page Tsujii Laboratory

MIYAO Yusuke (yusuke@is.s.u-tokyo.ac.jp)