Preprocessing is handled by the sentence_to_word_lattice/2 predicate found in the file "grammar/grammar.lil".Escaping special characters is handled by the file "grammar/preproc.lil" . Stemming is handled by the file "grammar/stem.lil".It is possible to do the preprocessing you like by changing "grammar/grammar.lil".
Preprocessing is done in the following order'
In the following, I would explain each of the above:
In order to use a tagger, the external tagger/2 predicate included in the MAYZ toolkit has to succeed. Use the initialize_external_tagger/2 predicate to start an external tagger and call external tagger/2. The first argument would be pass to the external tagger as input and the second argument would return the output from the external tagger.
The initialization of the tagger is done by initialize_external_tagger/2, which is found in the file "grammar/grammar.lil". If the environmental variable "ENJU_TAGGER" is assigned a value, it would run the tagger specified by that value. If the environmental variable "ENJU_TAGGER" is not assigned a value, an external tagger cannot be used.
The environmental variable "ENJU_TAGGER" is set by the enju command("parser/enju.cc") before initiializing up. By default, it is set to be "uptagger". Users can change the value assigned to "ENJU_TAGGER". Running ENJU with the -t option can also specify a tagger to be used with ENJU.
The output of the tagger will be stored in the array named '*enju_tagged_sentence*'.
The output of the tagger(the second argument of external_tagger/2) is assumed to be a string separated by the space character. The output string is segmented into words by single space characters.
A single space is taken as a separator. Multiple spaces are taken as the space character itself.
All words are now put in "WORD/POS" format. POS stands for POS given in the Penn Treebank. The sentence_to_word_lattice/2 predicate would turn them into a list that constitutes an 'extent'(defined in mayz/parser.lil). This list is the second argument of sentence_to_word_lattice/2. To be more precise, it is the token_to_word_lattice clause under sentence_to_word_lattice/2 that generates the actual list.
token_to_word_lattice would process each word in the following order:
When multiple POSs are assigned to the same word, the POSs are divided by "|". For each of the POS, a feature structure of 'word' type is created. All words are represented by a list of feature structures of the 'word' type.
Lists of 'word's created this way are assigned to the 'word' feature of feature strucutres of the 'extent_word' type. The 'extent_word' type is a subtype of the 'extent' type and it inherits from the 'extent' type the features 'left_pos' and 'right_pos', which are assigned the beginning position and the ending position of the 'extent'. Below is a feature structure of the 'extent' type:
extent
|
extent_word
|
token_to_word_lattice also deals with the processing of brackets.If a tagged sentence contains a "\(" and a "\)", a feature structure of the 'extent_bracket' type corresponding to the bracketed fragment is created. This would prevent up from generating constituents that cross brackets and increas the speed of parsing. In other words, if the tagger assigns a certain level of syntactic structure to the input sentence by bracketing fragements of the input sentence, the parser's performance can be enhanced.