PAHMA data cleaning—Object names

The object names in PAHMA's CMS were not selected from a controlled vocabulary. The Museum's pre-digital primary catalog (its card catalog) did not use object names; rather, it made use of very brief descriptions. Current object names come from diverse, idiosyncratic sources:

  • When the Museum started entering data in an early digital collection management system, the supervisor in charge of data entry decided to parse that brief description into two mutually exclusive fields: object name and description. In this way, an original description such as "Arrow shaft, unfinished" gave rise to the object name "Arrow shaft", and a description of "unfinished".
  • When an inventory was conducted and objects were encountered that had no object name entered, the person doing the inventory (who may not have known anything about the object in question) assigned an inventory object name.
  • For objects cataloged after the implementation of a digital collections management system (CMS), the Registrar assigned appropriate and meaningful object names.
  • As collections managers come across objects with little information in the CMS, they would often enter appropriate and meaningful object names.
  • For large, fairly homogenous collections that still had few or no object names entered, generic names were assigned (such as "photographic negative" for all unnamed photographic negatives).

There is rich information contained in object names currently in our CMS, but for a number of reasons, these object names need to be controlled. We have searched for an appropriate object name vocabulary, but have not found on that is well-suited to our large and wide-ranging collections. Until an appropriate object name vocabulary is found or developed, we will attempt to build our own vocabulary, working bottom up from existing object names.

Susan created a web page that presents usage statistics on TMS.Objects.ObjectNames (click here to view). The query she used was:

objnames.jsp
SELECT    LTRIM(ObjectName), COUNT(LTRIM(ObjectName))
FROM      Objects
WHERE     ObjectName IS NOT NULL AND RTRIM(ObjectName) !=''
GROUP BY  LTRIM(ObjectName)
ORDER BY  LTRIM(ObjectName)

At the beginning of this exercise (January 13, 2011) there were 34,187 distinct object names among approximately 640,000 records (the initial query did not exclude virtual and template records).

Excluding virtual and template records, there were 33,148 distinct, non-null object names used for 536,372 objects.

As of January 31, 2011, there were  27,999 distinct, non-null object names used for 535,967 objects (named object count was reduced as meaningless names such as "fragment" or "object" were eliminated).

The strategy we decided to follow was this:

  1. Create a new field (MTB_ObjectName) in which the revised object names can be staged and manipulated (complete)
  2. Create a new field (MTB_ObjectNameState) to hold certain specific information contained in the current object names (details below) (complete)
  3. Populate MTB_ObjectName with the contents of ObjectName WHERE IsVirtual=0 AND IsTemplate=0 (complete for 631,930 rows)
  4. Apply naming rules to the data in MTB_ObjectName:
    • Remove all capitalization (except from proper nouns) — not sure how to avoid the proper nouns (complete for 535,969 rows, incl. proper nouns)
      UPDATE    Objects
      SET       MTB_ObjectName = LOWER(MTB_ObjectName)
      WHERE     MTB_ObjectName IS NOT NULL
      
    • Remove leading and/or trailing spaces and newlines (complete)
      UPDATE    Objects
      SET       MTB_ObjectName = RTRIM(LTRIM(MTB_ObjectName))
      WHERE     MTB_ObjectName IS NOT NULL
      
    • Remove the following from object name (unless part of a compound noun):
      • Number/quantity
      • Color
      • Culture
      • Material
      • Dimensions
    • Move modifiers (e.g., "?", "frag."/"fragment", "incomplete", "broken") to MTB_ObjectNameState (which can be concatenated with MTB_ObjectName to make the displayed object name)
      UPDATE    Objects
      SET       MTB_ObjectNameState = 'fragments'
      WHERE     MTB_ObjectName LIKE '%frags.' and MTB_ObjectNameState IS NULL
      
      UPDATE    Objects
      SET       MTB_ObjectName = REPLACE(MTB_ObjectName, ' frags.', '')
      WHERE     MTB_ObjectName LIKE '%frags.'
      
    • Limit object name to 30 characters or less (unless a translation is involved; see next bullet point)
      SELECT    ObjectName, LEN(objectname) AS 'LENGTH', COUNT(ObjectName) AS 'COUNT'
      FROM      Objects
      WHERE     LEN(ObjectName)>30
      GROUP BY  ObjectName
      ORDER BY  LEN(ObjectName) DESC
      
    • For non-English object names (e.g., "huipil"), provide a parenthetical English translation after the preferred name (e.g., "huipil (blouse)"). The 30 character limit applies to a single object name, so a name with a translation would be allowed up to 30 + 30 = 60 characters
      • Question: could such non-English object names be italicized?
      • Answer: no, not without great difficulty.
  5. Once this is done and the Registrars are happy with the results:
    • ObjectName will be pre-pended to Description (in which cases?)
    • ObjectName will be replaced with contents of MTB_ObjectName
    • MTB_ObjectNameState will remain as-is

The following strings were removed from MTB_ObjectName:

  • very
  • small
  • fragment[.][,]
  • fragments[.][,]
  • fragment[s] of [a]
  • [;][,][:] frags
  • yellow, purple, pink, blue, orange, brown
  • 1, 2, 3, 4, 5, 6, 7, 8, 9 (beginning of string only, followed by space)
  • two, three, four, five

The following strings still need to be removed from MTB_ObjectName:

  • Ragged fragment of
  • Wooden fragment of
  • Torn fragment of
  • Two fragments of
  • Three fragments of
  • Tray of fragments of
  • Bundle of fragments of
  • Partially repaired fragments of
  • broken, 
  • brk.
  • Object:
  • Gray
  • Grey
  • Red