This section introduces the advanced usage of Enju.
Enju uses "stepp tagger" bundled in the package by default. When you want to use a tagger you developed, specify the option "-t" when invoking enju.
% enju -t your_own_tagger
"your_own_tagger" has to be executable (the path must be specified appropriately). The input to the POS tagger is in the same format as Enju (one sentence per line). The output must be in the following format.
John/NNP walked/VBD slowly/RB ./.
A word and its POS are connected by "/", and tokens are delimited by a single space. Enju will not work when more than one spaces appear.
Or, run your tagger in advance to tag your text, and input the tagged text to enju. In this case, specify the "-nt" option.
% your_own_tagger < RAW_TEXT > TAGGED_TEXT % enju -nt < TAGGED_TEXT
The format of TAGGED_TEXT is the same as above.
Enju uses UP, which is an efficient parser for unification-based grammars. UP is included in the MAYZ toolkit. Several interface in UP allows for getting the access to various information on parse results. For example, you can obtain HPSG signs, time required for parsing, and the number of edges. By writing LiLFeS programs by yourself, you can get your own output of parsing.
In fact, the default output formats of Enju (predicate-argument relations and XML outputs) are computed by LiLFeS programs. The source programs are provided in the package ("enju/grammar/{outputdep.lil,outputxml.lil}"). See these files for details. The HTTP/CGI server for parsing is also written in LiLFeS (see "enju/grammar/cgi.lil").
For details of UP, see the manual of UP.
The source package of Enju includes programs for making a grammar and probabilistic models from the Penn Treebank. By modifying the programs, users can improve or extend the grammar. The rebuilt of the grammar and probabilistic models require a certain machine power and time (around one day with 2.2 GHz Xeon, 10 GByte memory).
The programs for grammar making exploit the MAYZ toolkit. See the manual of the toolkit for details. Amis 4.0 or above is also required to be installed.
As input resources, you require ".mrg" files of Penn Treebank II (POS and tree structures are combined) and WordNet data files (index.*, *.exc) for stemming. Put these files in "Corpus/". By default, Makefile supposes "Corpus/02-21.trees" as an input of grammar construction. When you want to use another input file, rewrite the variable "TARGET_SECTION" defined in the beginning of "Makefile.am".
To re-build the Enju grammar, specify --with-enju-grmmar when you "configure" Enju.
./configure --with-enju-grammar
With this option, Makefile includes targets to make a grammar. Run make, and the grammar and probabilistic models will be rebuilt.
Additionally, add --with-genia-model when you want to retrain the parser with the GENIA treebank.
./configure --with-enju-grammar --with-genia-model
It is assumed that the GENIA treebank is put as "Corpus/0001-1600-trainingAll". If you want to use another file, rewrite the variable "GENIA_TARGET_SECTION" in Makefile.am.
The Enju package includes some useful LiLFeS modules. Specify the name of a module with the option "-l" of the "enju" command (see How to use Enju).
"enju/outputdep.lil" is a LiLFeS module to output predicate-argument relations of words in a text format. When you run enju without arguments, this module will be loaded and the predicate output_dependency_file will automatically be executed.
In this module, the following predicates are available. See How to use Enju for the details of the output format.
output_dependency(+$Sentence, +$Stream) | |
$Sentence | Input sentence (string) |
$Stream | Output stream (lilfes_stream) |
Parse input sentence $Sentence, and output predicate-argument relations into output stream $Stream. If parsing fails, the string "Parsing failure" is output. |
output_dependency(+$Sentence) | |
$Sentence | Input sentence (string) |
Parse input sentence $Sentence, and output predicate-argument relations into the standard output. If parsing fails, the string "Parsing failure" is output. |
output_dependency_file(+$Input, +$Output) | |
$Input | Name of input file |
$Output | Name of output file |
Parse each line of the input file, and output the results in the output file. |
output_dependency_file(+$Input) | |
$Input | Name of input file |
Parse each line of the input file, and output the results in the standard output. |
output_dependency_file | |
Parse each line of the standard input, and output the results in the standard output. |
output_summary_file(+$Input, +$Output) | |
$Input | Name of input file |
$Output | Name of output file |
Parse each line of the input file, and output the results in the output file in a simple format. |
output_summary_file(+$Input) | |
$Input | Name of input file |
Parse each line of the input file, and output the results in the standard output in a simple format. |
output_summary_file | |
Parse each line of the standard input, and output the results in the standard output in a simple format. |
"enju/outputxml.lil" is a LiLFeS module to output parse results in the XML format. When you run enju with "-xml" option output_xml_file will be executed, wile output_so_file is executed with "-so" option.
In this module, the following predicates are available. See How to use Enju for the details of the output format.
output_xml_file(+$Input, +$Output) | |
$Input | Name of input file |
$Output | Name of output file |
Parse each line of the input file $Input, and output the results in the output file $Output in the XML format. |
output_xml_file(+$Input) | |
$Input | Name of input file |
Parse each line of the input file $Input, and output the results in the standard output in the XML format. |
output_xml_file | |
Parse each line of the standard input, and output the results in the standard output in the XML format. |
output_so_file(+$Input, +$Output) | |
$Input | Name of input file |
$Output | Name of output file |
Parse each line of the input file $Input, and output the parse results into the output file $Output in the stand-off format. |
output_so_file(+$Input) | |
$Input | Name of input file |
Parse each line of the input file $Input, and output the parse results into the standard output in the stand-off format. |
output_so_file | |
Parse each line of the standard input, and output the parse results into the standard output in the stand-off format. |
"enju/moriv.lil" is a LiLFeS module to browse parse results with a web browser supporting XHTML and XSLT (e.g. FireFox) or MoriV. You can browse parse trees and feature structures graphically.
When running Enju, specify the "-moriv" option to run the CGI server.
% enju -moriv port_number
Next, access to "/cgi-lilfes/moriv?" at the port number you specified in the command line. For example, when you run Enju on "localhost", access to the following URL (assuming the port number is 27109).
http://localhost:27109/cgi-lilfes/moriv?
Your browser shows a start page for browsing parse results. Enter a sentence in the form in the top of the page and press the "Parse" button, and you will get the overview (sentence length, the number of edges, parsing time, etc.) of the parse results and the menu in the lower left side. By clicking links in the menu, you can browse parse results in various formats.
"Sign/Tree/Tree (with prob.)" shows the sign of the root node, the parse tree, and the parse tree with figure-of-merit (FOM). "Word lattice" shows the list of input words. "Node list" shows nodes in the parse tree, and you can see a sign of each node. "Semantics" shows predicate-argument dependencies of words. It is represented with Prolog-like term representation, and also with highlighting argument phrases by pointing a predicate word by a mouse cursor.
You can browse other data of a parser and a grammar by clicking links in the top of the page.
You can browse the following examples with web browsers such as Firefox and MoriV.