Preprocessing

Japanese version

Preprocessing is handled by the sentence_to_word_lattice/2 predicate found in the file "grammar/grammar.lil".Escaping special characters is handled by the file "grammar/preproc.lil" . Stemming is handled by the file "grammar/stem.lil".It is possible to do the preprocessing you like by changing "grammar/grammar.lil".

Preprocessing is done in the following order'

  1. Tagging
  2. Segmentation
  3. Building extents from words

In the following, I would explain each of the above:

Tagging

In order to use a tagger, the external tagger/2 predicate included in the MAYZ toolkit has to succeed. Use the initialize_external_tagger/2 predicate to start an external tagger and call external tagger/2. The first argument would be pass to the external tagger as input and the second argument would return the output from the external tagger.

The initialization of the tagger is done by initialize_external_tagger/2, which is found in the file "grammar/grammar.lil". If the environmental variable "ENJU_TAGGER" is assigned a value, it would run the tagger specified by that value. If the environmental variable "ENJU_TAGGER" is not assigned a value, an external tagger cannot be used.

The environmental variable "ENJU_TAGGER" is set by the enju command("parser/enju.cc") before initiializing up. By default, it is set to be "uptagger". Users can change the value assigned to "ENJU_TAGGER". Running ENJU with the -t option can also specify a tagger to be used with ENJU.

The output of the tagger will be stored in the array named '*enju_tagged_sentence*'.

Segmentation

The output of the tagger(the second argument of external_tagger/2) is assumed to be a string separated by the space character. The output string is segmented into words by single space characters.

A single space is taken as a separator. Multiple spaces are taken as the space character itself.

Building extents from words

All words are now put in "WORD/POS" format. POS stands for POS given in the Penn Treebank. The sentence_to_word_lattice/2 predicate would turn them into a list that constitutes an 'extent'(defined in mayz/parser.lil). This list is the second argument of sentence_to_word_lattice/2. To be more precise, it is the token_to_word_lattice clause under sentence_to_word_lattice/2 that generates the actual list.

token_to_word_lattice would process each word in the following order:

  1. Extract the word and its corresponding POS
  2. Calculate the position of the word in the input sentence.
  3. Escape special characters in the word or its corresponding POS label(the actual processing is done by "grammar/preproc.lil".
  4. Ignore words assigned POS labels specified by delete_pos/1(Currently, words assigned the POS label "." are ignored.
  5. stemming (The actual processing is done by "grammar/stem.lil".The stem database used for this process is "DATA/Enju.dict".)
  6. Input word/POS, escape sequence of the input/POS and stem of the Input word/POS are assigned to the INPUT/INPUT_POS feature, the SURFACE/POS feature and the BASE/BASE_POS feature. The position of the input word is assigned to the POSITION feature. The feature structure containing these features are defined in "mayz/word.lil".

When multiple POSs are assigned to the same word, the POSs are divided by "|". For each of the POS, a feature structure of 'word' type is created. All words are represented by a list of feature structures of the 'word' type.

Lists of 'word's created this way are assigned to the 'word' feature of feature strucutres of the 'extent_word' type. The 'extent_word' type is a subtype of the 'extent' type and it inherits from the 'extent' type the features 'left_pos' and 'right_pos', which are assigned the beginning position and the ending position of the 'extent'. Below is a feature structure of the 'extent' type:

extent
LEFT_POS:
RIGHT_POS:
and a feature structure of the 'extent_word' type:
extent_word
WORD: < >

token_to_word_lattice also deals with the processing of brackets.If a tagged sentence contains a "\(" and a "\)", a feature structure of the 'extent_bracket' type corresponding to the bracketed fragment is created. This would prevent up from generating constituents that cross brackets and increas the speed of parsing. In other words, if the tagger assigns a certain level of syntactic structure to the input sentence by bracketing fragements of the input sentence, the parser's performance can be enhanced.


Enju Developers' Manual Enju Home Page Tsujii Laboratory
MIYAO Yusuke (yusuke@is.s.u-tokyo.ac.jp)