About Enju

Enju is an accurate natural language parser for English. With a wide-coverage probabilistic HPSG grammar [1-7] and an efficient parsing algorithm [8-11], this parser can effectively analyze syntactic/semantic structures of English sentences and provide a user with phrase structures and predicate-argument structures. Those outputs would be especially useful for high-level NLP applications, including information extraction, automatic summarization, question answering, and machine translation, where the "meaning" of a sentence plays a central role.

This repository also includes the code for the Japanese CCG parser [19-21] and the Chinese HPSG parser [17-18]. The Japanese CCG parser is available as Jigg.

The main features of the Enju parser are:

Accurate deep analysis — the parser can output both phrase structures and predicate-argument structures. The accuracy of predicate-argument relations is around 90% for newswire articles and biomedical papers.
High speed — parsing speed is less than 500 msec. per sentence by default (faster than most Penn Treebank parsers), and less than 50 msec. when using the high-speed setting (mogura).

Other useful features are:

Output parse results in an XML format: specify the option -xml. The parser adds XML tags to an original text, and it is useful when parse results are merged with other processing results (e.g. named entities). A stand-off format is also available (specify -so).
Use a parsing model for the biomedical domain: specify the option -genia.
Use a parsing model for the literature domain: specify the option -brown.
Use a supertagger: run mogura -super
Convert Enju XML output into Penn Treebank-style output [15,16]: run enju2ptb/convert < ENJU_XML_OUTPUT > PTB_STYLE_OUTPUT
Let a POS tagger output ambigous POS tags: specify the option -A. Parsing accuracy improves, while parsing speed gets slower.
Output n-best parse results: specify the option -N. This is an experimental function, and parsing speed gets slower.