Advanced usage

Japanese version

This section introduces the advanced usage of Enju.


Use your own POS tagger

Enju uses "stepp tagger" bundled in the package by default. When you want to use a tagger you developed, specify the option "-t" when invoking enju.

% enju -t your_own_tagger

"your_own_tagger" has to be executable (the path must be specified appropriately). The input to the POS tagger is in the same format as Enju (one sentence per line). The output must be in the following format.

John/NNP walked/VBD slowly/RB ./.

A word and its POS are connected by "/", and tokens are delimited by a single space. Enju will not work when more than one spaces appear.

Or, run your tagger in advance to tag your text, and input the tagged text to enju. In this case, specify the "-nt" option.

% your_own_tagger < RAW_TEXT > TAGGED_TEXT
% enju -nt < TAGGED_TEXT

The format of TAGGED_TEXT is the same as above.


Access to parse results

Enju uses UP, which is an efficient parser for unification-based grammars. UP is included in the MAYZ toolkit. Several interface in UP allows for getting the access to various information on parse results. For example, you can obtain HPSG signs, time required for parsing, and the number of edges. By writing LiLFeS programs by yourself, you can get your own output of parsing.

In fact, the default output formats of Enju (predicate-argument relations and XML outputs) are computed by LiLFeS programs. The source programs are provided in the package ("enju/grammar/{outputdep.lil,outputxml.lil}"). See these files for details. The HTTP/CGI server for parsing is also written in LiLFeS (see "enju/grammar/cgi.lil").

For details of UP, see the manual of UP.


Making grammar from scratch

The source package of Enju includes programs for making a grammar and probabilistic models from the Penn Treebank. By modifying the programs, users can improve or extend the grammar. The rebuilt of the grammar and probabilistic models require a certain machine power and time (around one day with 2.2 GHz Xeon, 10 GByte memory).

The programs for grammar making exploit the MAYZ toolkit. See the manual of the toolkit for details. Amis 4.0 or above is also required to be installed.

As input resources, you require ".mrg" files of Penn Treebank II (POS and tree structures are combined) and WordNet data files (index.*, *.exc) for stemming. Put these files in "Corpus/". By default, Makefile supposes "Corpus/02-21.trees" as an input of grammar construction. When you want to use another input file, rewrite the variable "TARGET_SECTION" defined in the beginning of "Makefile.am".

To re-build the Enju grammar, specify --with-enju-grmmar when you "configure" Enju.

./configure --with-enju-grammar

With this option, Makefile includes targets to make a grammar. Run make, and the grammar and probabilistic models will be rebuilt.

Additionally, add --with-genia-model when you want to retrain the parser with the GENIA treebank.

./configure --with-enju-grammar --with-genia-model

It is assumed that the GENIA treebank is put as "Corpus/0001-1600-trainingAll". If you want to use another file, rewrite the variable "GENIA_TARGET_SECTION" in Makefile.am.


LiLFeS modules

The Enju package includes some useful LiLFeS modules. Specify the name of a module with the option "-l" of the "enju" command (see How to use Enju).


Output predicate-argument relations of words

"enju/outputdep.lil" is a LiLFeS module to output predicate-argument relations of words in a text format. When you run enju without arguments, this module will be loaded and the predicate output_dependency_file will automatically be executed.

In this module, the following predicates are available. See How to use Enju for the details of the output format.

output_dependency(+$Sentence, +$Stream)
$SentenceInput sentence (string)
$StreamOutput stream (lilfes_stream)
Parse input sentence $Sentence, and output predicate-argument relations into output stream $Stream. If parsing fails, the string "Parsing failure" is output.
output_dependency(+$Sentence)
$SentenceInput sentence (string)
Parse input sentence $Sentence, and output predicate-argument relations into the standard output. If parsing fails, the string "Parsing failure" is output.
output_dependency_file(+$Input, +$Output)
$InputName of input file
$OutputName of output file
Parse each line of the input file, and output the results in the output file.
output_dependency_file(+$Input)
$InputName of input file
Parse each line of the input file, and output the results in the standard output.
output_dependency_file
Parse each line of the standard input, and output the results in the standard output.
output_summary_file(+$Input, +$Output)
$InputName of input file
$OutputName of output file
Parse each line of the input file, and output the results in the output file in a simple format.
output_summary_file(+$Input)
$InputName of input file
Parse each line of the input file, and output the results in the standard output in a simple format.
output_summary_file
Parse each line of the standard input, and output the results in the standard output in a simple format.

Output parse results in XML

"enju/outputxml.lil" is a LiLFeS module to output parse results in the XML format. When you run enju with "-xml" option output_xml_file will be executed, wile output_so_file is executed with "-so" option.

In this module, the following predicates are available. See How to use Enju for the details of the output format.

output_xml_file(+$Input, +$Output)
$InputName of input file
$OutputName of output file
Parse each line of the input file $Input, and output the results in the output file $Output in the XML format.
output_xml_file(+$Input)
$InputName of input file
Parse each line of the input file $Input, and output the results in the standard output in the XML format.
output_xml_file
Parse each line of the standard input, and output the results in the standard output in the XML format.
output_so_file(+$Input, +$Output)
$InputName of input file
$OutputName of output file
Parse each line of the input file $Input, and output the parse results into the output file $Output in the stand-off format.
output_so_file(+$Input)
$InputName of input file
Parse each line of the input file $Input, and output the parse results into the standard output in the stand-off format.
output_so_file
Parse each line of the standard input, and output the parse results into the standard output in the stand-off format.

Browsing parse results with GUI

"enju/moriv.lil" is a LiLFeS module to browse parse results with a web browser supporting XHTML and XSLT (e.g. FireFox) or MoriV. You can browse parse trees and feature structures graphically.

When running Enju, specify the "-moriv" option to run the CGI server.

% enju -moriv port_number

Next, access to "/cgi-lilfes/moriv?" at the port number you specified in the command line. For example, when you run Enju on "localhost", access to the following URL (assuming the port number is 27109).

http://localhost:27109/cgi-lilfes/moriv?

Your browser shows a start page for browsing parse results. Enter a sentence in the form in the top of the page and press the "Parse" button, and you will get the overview (sentence length, the number of edges, parsing time, etc.) of the parse results and the menu in the lower left side. By clicking links in the menu, you can browse parse results in various formats.

"Sign/Tree/Tree (with prob.)" shows the sign of the root node, the parse tree, and the parse tree with figure-of-merit (FOM). "Word lattice" shows the list of input words. "Node list" shows nodes in the parse tree, and you can see a sign of each node. "Semantics" shows predicate-argument dependencies of words. It is represented with Prolog-like term representation, and also with highlighting argument phrases by pointing a predicate word by a mouse cursor.

You can browse other data of a parser and a grammar by clicking links in the top of the page.

Parser
As described above, a page for browsing parser outputs is shown.
Chart
A chart for parsing is shown. You can browse edges stored in each cell in the chart. By entering a sentence, a chart is shown in the lower left side in the page. Each cell shows the number of edges in the cell. By clicking the number, a list of edges in the cell is shown in the right of the page. The list shows symbols of edges signs (VP, NP, etc.), FOMs, the ID numbers of the daughters, etc. By clicking links, you can see the sign of an edge.
Grammar
A page for browsing lexical signs assigned to words is shown. By inputting a word and a POS tied with a slash (e.g. "likes/VBZ") in the first form, a list of lexical entries are shown in the lower left side in the page. The numbers show the FOM of the lexical entry. By clicking a link, the right frame shows a sign of the lexical entry. By inputting the name of a lexeme in the second form, the sign of the lexeme is shown in the right frame. "List of all templates" shows a list of all lexical entry templates. If the number of templates is large, it might take a while.
Console
Show a LiLFeS console in a new page.
Reset
Return to the top page.
Manual
Show the manual of Enju in a new page.
Enju Home Page
Show Enju Home Page in a new page.
Exit
Quit browsing.

You can browse the following examples with web browsers such as Firefox and MoriV.


Enju Manual Enju Home Page Tsujii Laboratory
MIYAO Yusuke (yusuke@is.s.u-tokyo.ac.jp)