Ontological basis

This section briefly presents the ontological foundations of the annotation and the relationship of the annotated types to the relevant ontological categories.

The Basic Formal Ontology (BFO) [1] is used as the foundation for organizing the span types defined for the annotation. BFO is a small top-level ontology that has been broadly adopted in particular in recent efforts to develop ontologies for science.

Additionally, for various types of entities relating specifically to information entities, we base our definitions on the Information Artifact Ontology (IAO) [2], originally a part of the Ontology for Biomedical Investigations (OBI) [3, 4]. OBI and IAO are explicitly based on BFO and thus compatible with its definitions.

Top-level organization

BFO uses entity as its top-level term and divides it at the uppermost level into continuants such as physical objects and occurrents such as processes (relations are understood to constitute a category distinct from entities). In the definition of the annotated span types, this basic division as well as much of the upper-level organization of BFO is followed, with a number of pragmatically motivated differences. Most obviously, the annotation avoids the technical BFO terms in favor of less precise but more readily undersood labels, such as TIME instead of temporal region. Part of the top-level structure of BFO is shown in the following with the relevant related types applied in the annotation.

(Indentation represents is-a relations, ellipsis (…) signifies omission of intervening structure. Only limited depth and detail relevant to the present effort shown, e.g. omitting spatiotemporal region_BFO , which is not used in the annotation.)

Reference ontology Annotation
occurrent -
processual entity process
temporal region time
continuant -
spatial region location
independent continuant -
object artifact, person
dependent continuant -
... quality quality

The definitions of the relevant BFO terms are provided below for reference. Note that the (broadly) corresponding types applied in the annotation differ from these definitions in some cases.

For further detail on BFO, refer to the literature on the topic [1]; for detailed definitions of the annotated types, refer to Span annotation.

Information entities

Information entities such as digital data are highly relevant in the target domain of the annotation effort. The annotated types corresponding to these entities are defined with reference to IAO information content entity, which is-a generically dependent continuant_BFO (see Top-level organization). Part of the relevant structure of IAO is shown in the following.

(As above, indentation represents is-a relations, ellipsis (…) signifies omission of intervening structure, and only limited depth and detail relevant to the present effort shown.)

Reference ontology Annotation
... dependent continuant -
information content entity -
data item data-item
textual entity data-item
directive information entity plan

The definitions of the relevant IAO terms are provided below for reference.

The annotation simplifies substantially over the IAO definitions on two points. First, no distinction is made between data item, textual entity, or related terms such as document, document part, and symbol; with all being typed data-item. Second, no distinction is made between subtypes of directive information entity, marking e.g. objective specification, plan specification, source code module and data format specification as plan.

Other annotated types

In addition to the annotated entity types corresponding to upper-level ontological categories (Ontological basis) and information entities (Information entities), a small number of other annotation types are defined to capture specific phenomena in text. These are briefly presented below.

References: The types reference and external-reference are defined to capture two distinct but related categories of references to text, the former within the document (e.g. anaphoric it) and the latter to other documents (e.g. citations of other articles). These types are defined only to support the representation of factual claims in annotated documents and are not aligned with any ontological categories.

QUANTITY: The applied top-level ontology does not aim to account for numbers. The annotation type quantity is defined to capture numbers and measurements.

MODALITY: Explicit statements regarding belief, probability and similar, including explicitly negated statements (e.g. not) are annotated as modality. This type is not aligned with any ontological categories.

Rare types: The annotation aims to associate the great majority of all entities that the authors explicitly mention with a relevant type, but does not aim for exhaustive coverage of all possible entities (see Ontological basis). Mentions of entities that do not fall within any of the defined categories are annotated with the “empty” type other. Specific types capturing rare entities that are particularly relevant to a domain are defined in a data-driven fashion. At the time of this writing, the following types have been defined following this process: language, organization, domain, and formula. Refer to these sections for information on these types.

Ambiguous types

In addition to basic types broadly corresponding to ontological categories, a few types explicitly representing ambiguities between the basic types are defined.

The division between continuants and occurrents is fundamental to the ontological basis of the annotated types. However, in natural language the division between the two is not always clear, and, for many practical uses of annotation, the distinction is not systematically required. Consider, for example, the expression web search: depending on context, this expression could refer to at least any of the following: a specific process, a function (e.g. of a particular software) that could be realized as such a process, a goal to be achieved, or a set of steps for doing so. Such ambiguities persist in statements such as web search is inefficient and method M improves web search performance. However, resolution of this ambiguity is not required to identify e.g. that the latter statement expresses that method M has a positive effect on web search.

In the annotation, systematic ambiguity between expressions referring to processes, entities that can be realized as such processes, and information entities that indicate how to realize such processes is captured without resolution with the type plan-or-process, which is used in cases where this ambiguity occurs. In these cases, the annotation thus does not systematically differentiate between continuants and occurrents, but does preserve the closely related distinction between “static” entities that cannot take processual participants and (potentially) “dynamic” ones that can.

A second, considerably rarer systematic ambiguity occurs in some cases between references to people, computational methods emulating intelligent behaviour, and abstract, theoretical actors. For example, in documents on game theory, authors may refer to intelligent agents to intentionally abstract over human actors and others exhibiting (arguably) intelligent behaviour. The annotation type intelligent-agent is defined to annotate references where this ambiguity occurs.

Individuals, collectives, and universals

The distinction between individuals (e.g. Barack Obama), collections of individuals (e.g. members of congress) and universals (e.g. people) is fundamental in many formal ontological descriptions of reality. Individuals such as specific people are said to instantiate (stand in the instance-of relation to) universals such as the human species. Universals have multiple instances, while individuals have none. These categories do not overlap and are clearly distinguishable.

However, in natural language, references to individuals, universals and collectives are frequently ambiguous. While references to e.g. Barack Obama clearly identify an individual person, references to e.g. Microsoft Windows may, depending on context, refer to a specific individual copy of the software, some particular set of copies, or the universal constituted by all individual copies of the software.

For the purposes of this annotation effort, the distinction between individuals, collectives, and universals is ignored. Thus, for example, Barack Obama, congress members and humans are all annotated as person.