Corpus transformation

Japanese version

The grammr of ENJU is developed by tranforming phrase structure trees of Penn Treebank into HPSG-style phrase structure trees. This transformation is done with the treetrans tool of mayz. For more details on treetrans, please refer to the manual of mayz.

treetrans rule module input file output database
rule modulelilfes file that contains the rules for transforming phrase structure trees
input filetreebank in(text format)
output databasetreebank out(lildb format)

In the case of ENJU grammar, all lines are filled by text files which contain the phrase structure tree of Penn Treebank. The input file line would be as follows:

(S (NP-SBJ Ms./NNP Haag/NNP) (VP plays/VBZ (NP Elianti/NNP)) ./.)
treetrans would read the above text. The generated phrase structure tree would be represented by feature structures defined in the file "treetypes.lil" that comes with mayz. Next, the feature structures would be further transformed by pattern rules in the rule module and supplemented with additional information.

The processing of phrase structure trees is actually broken down into the following steps:

  1. Supply a Phrase Structure Tree as Input
  2. Pre-processing
  3. Transformating by Pattern Rules

Supply a Phrase Structure Tree as Input

A phrase structure tree represented as a feature structure is constructed out of a phrase structure tree supplied in text format. To do this, call the predicate input_parse_tree/2. Supply it with the line bearing the heading "input file".

In the case of ENJU, the treebank used as input is the Penn Treebank, which can be handled by the input_ptb_parse_tree/2 predicate comes with mayz. The input_ptb_parse_tree/2 predicate is called as a sub-clause of the input_parse_tree/2 of ENJU.

The input_ptb_parse_tree/2 predicate only converts the input tree to feature structures without doing any real change to the phrase structure of the tree. Leaf nodes can be changed with the following predicates: 語ptb_empty_category/1ptb_preprocess_pos/2ptb_delete_pos/1ptb_preprocess_word/2. What these predicates do is given as follows:

  1. POS tags specified by ptb_empty_category/1 would be considered as empty cateogories and assigned the 'tree_empty' type.
  2. Pre-processing of POS tags is done by the ptb_preprocess_pos/2 predicate
  3. Pre-processed POS tags specified by the ptb_delete_pos/1 predicate would be ignored. No feature structure would be generated for tree nodes bearing such tags.
  4. Nodes that are not marked as ignored are pre-processed by the ptb_preprocess_word/2 predicate.

In the case of ENJU, the following is specified for a phrase structure tree supplied as the input