Vocabulary Service Description and Assumptions

Description

Unknown macro: {multi-excerpt} A vocabulary is a collection of controlled terms. These may be unorganized, an ordered list, a taxonomic tree, or a relational graph (ontology).These are variously called (controlled) vocabularies, taxonomies, ontologies, authorities, and other terms as well. The Vocabulary has terms with individual information including provenance and bibliographic references, and these terms are related in some structure.

Certain standard functionality is required by consumers of all vocabularies, and of services that must "act like" vocabularies. This latter group includes Location, Organization, and Person, among others. The model supported by this service can be thought of as an interface or abstract class definition (as in the Java language). The simplest implementing "class" (in practice, a service) supports "pure" vocabularies that are just enumerations, controlled term lists or concept hierarchies. Other implementing "classes" may extend the respective classes to support additional semantics. E.g.:

  • A simple zoological taxonomy may add the "rank" (Family, Genus, Species, etc.)  to each VocabularyItem
  • The Location service may add additional properties like lattitude and longitude
  • The Organization service may define types of relations to indicate different hierarchies: financial vs. political.

Assumptions

Unknown macro: {multi-excerpt}
  1. Vocabulary as we define it includes everything from simple enumerations to thesauri to taxonomy to ontology; where we say "vocabulary", any and all of these must be considered.
  2. Some vocabularies are simple lists or trees of terms, while others model concept hierarchies and graphs, in which each concept has multiple terms, synonmy, linguistic enhancement for text mining, reference lists to support the definition or inclusion of a concept, and other such annotations.
  3. Many other service entities must support vocabulary models and behaviors. E.g., Person, Organization, and Location all act like controlled vocabularies in the application, all have structural and other relations among the terms, and all may need to support import of existing, external definitions (Name Authorities, Org Charts, Placename gazetteers, etc.).

  1. There are many different types of relations that must be supported, to organize and relate concepts or items within a vocabulary. These include broader/narrow terms, synonyms, physical containment or part-of relations, and many more.
  2. We need to be able to support multiple, distinct hierarchies over the same set of concepts, and to distinguish among the different hierarchies (e.g., organizations may be related both in a structural or political hierarchy, and also in a finanical or budgetary hierarchy).
  3. Some vocabularies are defined local to an institution, while others are defined and maintained by a community or external body.
    1. We need to be able to import standarized vocabularies, taxonomies and even ontology, an provide them for use in an application.
    2. We may have to export vocabularies for sharing with other institutions.
    3. Even for imported vocabularies, it is common for an institution to make additions and changes.
  4. Vocabularies change over time, and we must be able to represent these changes, optionally keeping a history of changes, new and deprecated terms, etc.
    1. We need to allow local users to modify an existing or imported vocabulary
    2. We may have to manage the local changes to a centrally maintained vocabulary, including differences that can be applied as the reference vocabulary is updated by the associated responisble body.
  5. In addition to basic CRUD operations to manage vocabularies, we must support certain application functionality. In particular, autocomplete must provide a ranked list of terms that match a partial pattern. There may be various means to define the appropriate ranking (alphabetical, recently-used, most-used overall, etc.).

Key Concepts

  • Vocabulary - a collection of items and (often, but not necessarily) relations, and the general semantics of the vocabulary as a whole (whether static or dynamic, types of relations allowed, etc.).
  • VocabularyItem - a term in an enumeration/authority, a taxon, or a concept in an ontology. Is "owned" by a single vocabulary. Is related to other VocabularyItems through VocabularyItemRelations. It is referenced as an abstract URI constructed from the name of the vocabulary and an item id.
  • VocabularyItemRelation - an RDF-like triple associating two VocabularyItems in some manner. These triples are typed and directional (although some may be associative). They may also be associated to a named sub-set, allowing multiple overlapping graphs on the same VocabularyItems.
  • VocabularyReferenceResolver - a utility service that can take a structured VocabularyItem reference (a URI) and convert this to a REST path that invokes a service that supports the named vocabulary.

These concepts are detailed below.

Vocabulary

A Vocabulary instance is a namespace of (associated) controlled terms and relations. The Vocabulary defines which types of VocabularyItemRelations are allowed, which effectively defines the shape and features of the implied graph.

The Vocabulary has a display-name, a reference to the owning organization, a simple id value, and a base-uri (an abstract URI used to construct external references to the individual VocabularyItems). It must support read/search methods from patterns (to support autocomplete UI tools); Vocabulary must support or have access to usage statistics that enable sorting search results by recency of usage or overall usage (across the collections, or by a given user).

Individual services implementing Vocabulary semantics may add additional properties as well.

VocabularyItem

A VocabularyItem is a single controlled term with a vocabulary. It may be the preferred term, or an alternate spelling or synonym of another term in the vocabulary, linked to with a VocabularyItemRelation. The term may have associated literature or citations (especially for life sciences taxa), again, linked to with a VocabularyItemRelation.

The VocabularyItem has a display-name, a reference to the owning vocabulary, a simple id value (used in the context of the owning Vocabulary), and a reference-uri (an abstract URI used in external references to the id, e.g., by other service entities).

Individual services implementing Vocabulary semantics may define additional VocabularyItem properties as well.

VocabularyItemRelation

VocabularyItemRelations link VocabularyItems into trees and graphs, and link to synonyms, related terms, etc. A VocabularyItemRelation has a type and an optional subtype. It may also have a set reference (id), which is effectively a sub-namespace within the namespace of the Vocabulary, allowing for multiple graphs over the same concepts (i.e., VocabularyItems). These types define the semantics of the relation, including the allowed entities referenced (e.g., a(nother) VocabularyItem, an external URI, etc.), and properties like associativity, commutativity, symmetric relations, etc. Example types may include (but not be limited to):

  • Type:Parent, referent is another VocabularyItem in the current Vocabulary
    • SubType:Broader-term/Hypernym
    • SubType:partOf (a.k.a. Holonym, i.e., child is part of parent)
    • SubType:Political/Organizational
    • SubType:Group
    • etc.
  • Type:Child, referent is another VocabularyItem in the current Vocabulary
    • SubType:Narrower-term/Hyponym
    • SubType:hasPart (child is part of parent)
    • etc.
  • Type:Internal-ref, referent is another VocabularyItem in the current Vocabulary
    • SubType:Synonym/alternate term
    • SubType:Preferred term
    • SubType:Related (unspecified)
    • etc.
  • Type:ExternalURI-ref, referent is a URI/URL
    • SubType:Equivalent (in another vocabulary)
    • SubType:Derived from (some other vocab term)
    • SubType:Documentation
    • SubType:Standard reference URL, e.g., for a GenBank reference.
    • etc.

Individual services implementing Vocabulary semantics may define additional VocabularyItemRelation types and subtypes. Each Vocabulary defines the set of allowed relations, and any allowed sets. For example:

  • Enumerations allow no relations - this is the simplest case. Some simple authorities may just be a list of names.
  • Thesauri allow synonymy and other related terms within the vocabulary, but some may not have any parent/child structures.
  • Simple graphs may support sparse parent relations. E.g., Person allows parent-child relations for Political/Organizational reporting structures, but many entries may notbe linked in the graph.
  • Taxonomies allow parent/child relations for broader and narrower terms, and may support synonymy and other related terms within the vocabulary. Life Science taxonomies often have external links to supporting documentation, bibliographic references, etc. They also define additional properties on the items, like rank.
  • Ontologies like Location may allow multiple types of parent/child relations for political and geographical containment, for physical/built containment (e.g., a room in a building) or position-within (e.g., a shelving unit in a room). There may be alternate names for locations, including language variants. They will likely add additional properties on the items, like lattitude and longitude, altitude or height, and perhaps even a polygon describing a region.
VocabularyReferenceResolver

The VocabularyReferenceResolver is a central utility service that maintains a registry of namespaces and the services that manage/support them. Given an abstract URI for a VocabularyItem reference (e.g., the EntryMethod field of Intake refers to one of a simple enumeration of values with an abstract URI), it will return a REST URL for the service that supports this. E.g.,

  • Need example of mapping simple enumeration term to a Vocabulary service URL
  • Need example of mapping getty ULAN term to a Person service URL
  • Need example of mapping storage location term to a Location URL

Additional Notes:

Terms or concepts in a vocabulary are used in several ways:

  1. As field values in some entities, where the type of the field references the vocabulary. This case is commonly (exclusively?) used with enumerations.
  2. As an associated annotation to some entity, where there may be multiple associations. These associations can have considerable additional metadata:
    1. Who authored the association. This may be a Person or it may be an algorithm in the case of automated indexing.
    2. Begin and End dates of the association (it may be rescinded or superceded at some point).
    3. Notes from the author/creator/indexer.
    4. Supporting documentary material for the association.
    5. A confidence value for the association.

We will at some point need to consider the management of local edits to a shared vocabulary. This may require a merging facility if the shared resource is regularly maintained (e.g., by a community of contributors, as for life science taxonomy). We would need to handle subscription-based updates from the common resource, and then (re-)apply local edits and changes. Naturally, this can get non-trivial. We should review SVN-like tools that manage branches from a common trunk. This is a future feature - see also the scope section below.

Vocabulary Maintenance

There are a host of use-cases around the maintenance of vocabularies, ranging from the update of a standard, shared termlist from some central authority, to the addition of local terms. Spectrum describes a basic set of activity data to be maintained as part of these activities including:

  • The change made (e.g., concept/taxon/VocabularyItem addition, deletion/deprecation, update to relations/position, update of metadata (links to publications, external documentation), etc., Vocabulary subscription update, Vocabulary import, etc.
  • Who authorized the change (Person)
  • Who performed or recorded the change (Person)
  • When the change was performed or recorded (Date)
  • Source or Author of the new/changed information (Person or Organization)
  • Date of the new/changed information at the source (Date)

Spectrum further mentions the "Recording progress", which is a small controlled vocabulary (enumeration) of states in a workflow, e.g., "in progress, to be approved, etc.". We will model this somewhat more explicitly as a workflow state on each term, including states like the following (this is still evolving):

  • Proposed. Note that the author is the one who proposes/proposed the change. Not sure of label for this, but we may need to distinguish a new item that does not even appear in the vocabulary, from one that has been tentatively added and is pending review.
  • Tentative. E.g., when someone needs a new term and adds it in the UI, but it is pending review by the vocabulary managers.
  • Reviewed. Reviewed will likely need several variants, including Reviewed-approved, Reviewed-rejected, Reviewed-approved-with-modifications. This will need additional metadata on the reason for review finding (which might be a controlled termlist, and text notes. We we must model the oroginal proposer as well as the reviewer, and also handle the case in which there is a chain of reviewers.
  • Accepted. This is the normal case of a term in use.
  • Rejected. This may be needed after a term has been rejected but until any references to the tentative term have been updated.

Dependencies

  • TBD

Background Documentation