Ontological basis
This section briefly presents the ontological foundations of the annotation and the relationship of the annotated types to the relevant ontological categories.
The Basic Formal Ontology (BFO) [1] is used as the foundation for organizing the span types defined for the annotation. BFO is a small top-level ontology that has been broadly adopted in particular in recent efforts to develop ontologies for science.
Additionally, for various types of entities relating specifically to information entities, we base our definitions on the Information Artifact Ontology (IAO) [2], originally a part of the Ontology for Biomedical Investigations (OBI) [3, 4]. OBI and IAO are explicitly based on BFO and thus compatible with its definitions.
Top-level organization
BFO uses entity
as its top-level term and divides it at the
uppermost level into continuants such as physical objects and
occurrents such as processes (relations are understood to constitute
a category distinct from entities). In the definition of the annotated
span types, this basic division as well as much of the upper-level
organization of BFO is followed, with a number of pragmatically
motivated differences. Most obviously, the annotation avoids the
technical BFO terms in favor of less precise but more readily
undersood labels, such as TIME instead of temporal region
. Part of the
top-level structure of BFO is shown in the following with the relevant
related types applied in the annotation.
(Indentation represents is-a relations, ellipsis (…) signifies
omission of intervening structure. Only limited depth and detail
relevant to the present effort shown, e.g. omitting spatiotemporal
region
_BFO , which is not used in the annotation.)
Reference ontology | Annotation |
---|---|
occurrent |
- |
processual entity |
process |
temporal region |
time |
continuant |
- |
spatial region |
location |
independent continuant |
- |
object |
artifact, person |
dependent continuant |
- |
... quality |
quality |
The definitions of the relevant BFO terms are provided below for reference. Note that the (broadly) corresponding types applied in the annotation differ from these definitions in some cases.
-
occurrent
: An entity that has temporal parts and that happens, unfolds or develops through time. -
processual entity
: An occurrent that exists in time by occurring or happening, has temporal parts and always involves and depends on some entity. -
temporal region
: An occurrent that is part of time. -
continuant
: An entity that exists in full at any time in which it exists at all, persists through time while maintaining its identity and has no temporal parts. -
spatial region
: A continuant that is neither bearer of quality entities nor inheres in any other entities. (Examples: the sum total of all space in the universe, parts of the sum total of all space in the universe) -
independent continuant
: A continuant that is a bearer of quality and realizable entity entities, in which other entities inhere and which itself cannot inhere in anything. -
object
: An independent continuant that is spatially extended, maximally self-connected and selfcontained […] and possesses an internal unity. The identity of substantial object entities is independent of that of other entities and can be maintained through time. -
dependent continuant
: A continuant that is either dependent on one or other independent continuant bearers or inheres in or is borne by other entities. -
quality
: A specifically dependent continuant that is exhibited if it inheres in an entity or entities at all (a categorical property).
For further detail on BFO, refer to the literature on the topic [1]; for detailed definitions of the annotated types, refer to Span annotation.
Information entities
Information entities such as digital data are highly relevant in the
target domain of the annotation effort. The annotated types
corresponding to these entities are defined with reference to IAO
information content entity
, which is-a generically dependent
continuant
_BFO (see Top-level organization). Part of the relevant structure of
IAO is shown in the following.
(As above, indentation represents is-a relations, ellipsis (…) signifies omission of intervening structure, and only limited depth and detail relevant to the present effort shown.)
Reference ontology | Annotation |
---|---|
... dependent continuant |
- |
information content entity |
- |
data item |
data-item |
textual entity |
data-item |
directive information entity |
plan |
The definitions of the relevant IAO terms are provided below for reference.
-
information content entity
: an information content entity is an entity that is generically dependent on some artifact and stands in relation of aboutness to some entity. [IAO presently takes aboutness (is-about) to be a primitive relation with no definition beyond “relates an information artifact to an entity.”] -
data item
: a data item is an information content entity that is intended to be a truthful statement about something (modulo, e.g., measurement precision or other systematic errors) and is constructed/acquired by a method which reliably tends to produce (approximately) truthful statements. -
textual entity
: A textual entity is a part of a manifestation […], a generically dependent continuant whose concretizations are patterns of glyphs intended to be interpreted as words, formulas, etc. (Examples: Words, sentences, paragraphs, and the written (non-figure) parts of publications are all textual entities) -
directive information entity
: An information content entity whose concretizations indicate to their bearer how to realize them in a process.
The annotation simplifies substantially over the IAO definitions on
two points. First, no distinction is made between data item
,
textual entity
, or related terms such as document
, document
part
, and symbol
; with all being typed data-item.
Second, no distinction is made between subtypes of directive
information entity
, marking e.g. objective specification
, plan
specification
, source code module
and data format specification
as plan.
Other annotated types
In addition to the annotated entity types corresponding to upper-level ontological categories (Ontological basis) and information entities (Information entities), a small number of other annotation types are defined to capture specific phenomena in text. These are briefly presented below.
References: The types reference and external-reference are defined to capture two
distinct but related categories of references to text, the former
within the document (e.g. anaphoric it
) and the latter to other
documents (e.g. citations of other articles). These types are defined
only to support the representation of factual claims in annotated
documents and are not aligned with any ontological categories.
QUANTITY: The applied top-level ontology does not aim to account for numbers. The annotation type quantity is defined to capture numbers and measurements.
MODALITY: Explicit statements regarding belief, probability and
similar, including explicitly negated statements (e.g. not
) are
annotated as modality. This type is not aligned
with any ontological categories.
Rare types: The annotation aims to associate the great majority of all entities that the authors explicitly mention with a relevant type, but does not aim for exhaustive coverage of all possible entities (see Ontological basis). Mentions of entities that do not fall within any of the defined categories are annotated with the “empty” type other. Specific types capturing rare entities that are particularly relevant to a domain are defined in a data-driven fashion. At the time of this writing, the following types have been defined following this process: language, organization, domain, and formula. Refer to these sections for information on these types.
Ambiguous types
In addition to basic types broadly corresponding to ontological categories, a few types explicitly representing ambiguities between the basic types are defined.
The division between continuants and occurrents is fundamental to the
ontological basis of the annotated types. However, in natural language
the division between the two is not always clear, and, for many
practical uses of annotation, the distinction is not systematically
required. Consider, for example, the expression web search
:
depending on context, this expression could refer to at least any of
the following: a specific process, a function (e.g. of a particular
software) that could be realized as such a process, a goal to be
achieved, or a set of steps for doing so. Such ambiguities persist in
statements such as web search is inefficient
and method M
improves web search performance
. However, resolution of this
ambiguity is not required to identify e.g. that the latter statement
expresses that method M
has a positive effect on web search
.
In the annotation, systematic ambiguity between expressions referring to processes, entities that can be realized as such processes, and information entities that indicate how to realize such processes is captured without resolution with the type plan-or-process, which is used in cases where this ambiguity occurs. In these cases, the annotation thus does not systematically differentiate between continuants and occurrents, but does preserve the closely related distinction between “static” entities that cannot take processual participants and (potentially) “dynamic” ones that can.
A second, considerably rarer systematic ambiguity occurs in some cases
between references to people, computational methods emulating
intelligent behaviour, and abstract, theoretical actors. For example,
in documents on game theory, authors may refer to intelligent
agents
to intentionally abstract over human actors and others
exhibiting (arguably) intelligent behaviour. The annotation type intelligent-agent is defined to annotate references
where this ambiguity occurs.
Individuals, collectives, and universals
The distinction between individuals (e.g. Barack Obama), collections of individuals (e.g. members of congress) and universals (e.g. people) is fundamental in many formal ontological descriptions of reality. Individuals such as specific people are said to instantiate (stand in the instance-of relation to) universals such as the human species. Universals have multiple instances, while individuals have none. These categories do not overlap and are clearly distinguishable.
However, in natural language, references to individuals, universals
and collectives are frequently ambiguous. While references to
e.g. Barack Obama
clearly identify an individual person,
references to e.g. Microsoft Windows
may, depending on context,
refer to a specific individual copy of the software, some particular
set of copies, or the universal constituted by all individual copies
of the software.
For the purposes of this annotation effort, the distinction between
individuals, collectives, and universals is ignored. Thus, for
example, Barack Obama
, congress members
and
humans
are all annotated as person.