Other works are also available at our GitHub repository.

Syntactic Parsing


Jigg is a framework to easily combine various natural language processing tools such as chunking, dependency parsing, POS tagging, semantic parsing, and so on. You will be able to use various NLP tools by downloading JAR archives.

Moreover, a Japanese syntactic parser based on Combinatory Categorial Grammar (CCG) has been implemented. This is used in ccg2lambda, software for recognizing textual entailment.


Corbit is an integrated text analyzer for Chinese, which performs word segmentation, part-of-speech (POS) tagging, and dependency parsing of Chinese text with state-of-the-art performance. Corbit is built based on incremental, transition-based parsing algorithms, which enable to process each of these tasks individually, or any combinations of these tasks with joint decoding, in a very efficient manner. The joint decoding usually results in higher accuracies, while slowing down the processing speed as its complexity grows.


Enju is a deep syntactic parser for English based on the HPSG theory. The Enju parser not only analyses phrases and dependency structures but also detailed syntax and semantic structures (predicate argument structures) at high speed and accuracy.

Kaede Treebank

Kaede Treebank is a constituent-based treebank for Japanese. It provides phrase-structure annotation data of a part of the Kyoto Text Corpus. This treebank is useful for training a constituent parser for Japanese. This resource has also been used to develop a CCG parser module for Jigg.

NIIVTB Vietnamese Treebank

NIIVTB is a constituent-based treebank for Vietnamese. It consists of around 20,000 sentences from news texts, annotated with word segmentation, POS tags, and bracketings.

Universal Dependencies

This is a project to develop multilingual treebanks in a universal format. We are engaged in the development of Japanese and Amharic treebanks.

Semantic Parsing


ccg2lambda is a system for recognizing textual entailment based on higher-order logic using CCG syntactic parsing. It uses C&C Parser and EasyCCG Parser for English, Jigg for Japanese. It offers recognizing textual entailment based on a rule-based inference engine, and it has achieved successful results in various evaluation datasets.


TIFMO is a system for recognizing textual entailment relations in natural language texts. The system achieves accurate recognition of advanced logical inference including universal quantifiers and negations, as well as the large variety of paraphrasing observed in real world texts. TIFMO analyses meanings of sentences using Dependency-based Compositional Semantics, and is able to handle the meaning with various linguistic and world knowledge in fast logical inference.


We organized NTCIR RITE tasks which deal with the recognition of inference relations in texts in evaluation style workshop NTCIR. Recognition of inference relations in texts is a technology which automatically recognizes equivalence and difference of the meaning of two different texts. We generate evaluation data using texts extracted from Wikipedia and university entrance exams and provide them to participating teams.


Automatic Video Description Generation

This is software for automatic generation of explanation in natural language on image content of video data as known as an automatic video description generation task. The model applies weighting to focused frames in video data based on sequence-to-sequence model used in machine translation, and has achieved high accuracy in more than one dataset.

Knowledge Discovery from Academic Papers


RANIS is an annotation corpus which imparts semantic relationships in academic papers. It annotates terms in academic papers and semantic relationships such as “method”, “purpose”, “result” among the terms. Data on English abstracts of papers in ACM and ACL, and data on Japanese abstracts of the papers in IPSJ Journal are available. We have also released the annotation guidelines.

Question Answering

Artificial Intelligence Project

NII promotes Artificial Intelligence Project which develops integrated artificial intelligence as clever as to pass university entrance exams. Questions of university entrance exams are given and answered in natural language and are prime examples of natural language processing. When we analyze the process of understanding and answering questions (thinking, that is), however, we realize that various artificial-intelligence technologies are required not only natural language processing but also understanding and manipulation of mathematical formula, domain knowledge, logical inference, and unified comprehension of verbal and nonverbal information (such as graphs and pictures). Through the development of an integrated system that solves university entrance problems, we intend to shed light on what can or cannot be done with orchestrated frontier AI technologies, as well as what role natural language processing can play.


NIILC-QA is a dataset of question-answers in Wikipedia with various additional information. It aims to develop technologies so that the system itself can explain the process to find an answer to a question. For this purpose we have added information such as keywords or queries manually.

Dialogue Systems


We organized NTCIR STC tasks which deal with Short Text Conversation (STC) to generate short conversation in evaluation style workshop NTCIR. We have constructed datasets with human judgements for appropriateness of question-answer pairs.

Infrastructure Software


Amis is a software which can learn maximum entropy models based on Feature Forest Model. It is used in Enju Project to learn models to eliminate the ambiguity.


LiLFeS is a logic programming language with typed Feature structure. It can call feature structure processing from C++ and can be used as a library. It serves the implementation of Enju.