Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.
Comment: Migrated to Confluence 4.0

Notes on Data Cleanup, M. Ahern

Key: Red is for areas in which I still need to run preliminary reports. Purple is for bits of data that will need to be moved by hand. The rest of the report describes data migration that can probably be automated. Whether or not it is worth automating this data migration will depend on how labor-intensive it is to migrate automatically versus by hand. 

...

  • mfp (Main Photo File): 164 object records.
  • OH (oral histories): 5 object records.
  • FP (film program photographs): 943 object records.

...

  • For loans: Loan numbers for objects should be formatted like: "IL0001.0001." However, about 320 records are formatted like "IL-0004.0001." The hyphen should be removed.

...

Data Cleanup that can be automated. See also Objects data cleanup completed by hand and notes on OC objects data cleanup.

  • ARTIFACT CLASS, WORK TYPE, COLLECTION CATEGORY: Many uncategorized objects can be categorized using the attributes "Category" and "Subcategory." See Classification Walkover (simple), and a more detailed, object-by-object walkover which shows how many objects are in each artifact class/work type. Many of these objects on this list can be classified by hand (where there are too few objects in a category to justify automating their classification).
  • Many objects will have to be classified by hand by a cataloger before the migration, or else left undetermined. There is one list of assorted objects to classify and one list of books to classify. See full list of objects to classify and list of books to classify.
  • Dimensions the Classification Walkover page for instructions.
  • DIMENSIONS: - In OC, "Dimensions" is a single field containing up to three values. In CS, there will be multiple fields for dimensions that will be named according to type (e.g., height, width, depth, diameter, etc). We will need to separate each of the three values into its own field.
    • ASSIGNING MEASUREMENT TYPES Assigning measurement types:
      • In most cases, each value will go in its own field WITHOUT a measurement type. This is because, while most OC records list the values in a specific order (height, width, then depth), many records do not follow this rule.
      • Fields that contain all three characters (H, W, and D): We can assign a measurement type to the value based on the character.
        • In most cases, there is a space between the value and the letter, eg:
          • 10 H x 11 W x 2 D, or
          • 10" H x 11" W x 2" D
      • In catalog records created after June 2011, there is no space after the value:
        • 10h x 11w x 2d
      • Fields that contain the word "Diameter" or "Dia": We can assign a measurement type to the value based on that word. (Given that the letter "D" repeats in both "D" for "Depth," as above, and in "Diameter," will it be possible to automate measurement type?)CS Field: Dimension
      • Measured By: Data from the attribute field "Other Physical Details" should be moved here with the text incorporated into a note. **It's unclear whether or not there's a way to automate this.
      • There is a CS Field for each dimension called "Dimension Measured By." Is there a way to pull this information from "Object Histories"?FORMATTING MEASUREMENT VALUES
    • Formatting measurement values:
      • In all cases we can convert fractions (1/2) into decimals (.5). If we do so, we will have to delete the space before the decimal, unless the fraction is the first value in the list of 2-3 lengths listed in the "dimensions" field.
      • Many decimal values are displayed in increments smaller than .25 (e.g. .125, .375), and many values include fractions in increments of smaller than 1/4 (e.g., 3/8 or 5/8). Should we round these up or down?
    • Extent - Many objects have no extent. (Extent report).
      • Some of these objects have lot numbers but not individual accession numbers. Thus, the lot will contain multiple objects with the same 7-digit accession number.
        • According to Megan, many of these lots contain extent info for the entire lot, listed in the access database order, in their "Administrative remarks" fields. However, this is not a consistent pattern. See extent report on lots or groups of objects that have inventory extent info from the access database in the admin notes field here. We need to go through this data by hand and try to assign it to its proper place. Finish cleaning up this data.
      • Some of these objects have no accession number (that is, their accession number is listed as "unknown."). Move data for "unknown" object extents over from the old access database.
      • For many of these objects, it is obvious from the title that the extent is "1." However, these records number in the thousands.
    • Registrar Status -
      • This is the correct formatting for objects smaller than (x); for larger objects, these values should be rounded.
  • REGISTRAR STATUS: Delete this field entirely.
  • PhysPHYS_remarks - REMARKS: This is a hidden data field on OC -- it appears in the database and in reporting, but cannot be edited through the OC user interface. Over 5000 records contain information in this field. About 5000 records have See report.
    • ALL information in this field , so it would be difficult to move this information by hand. See report. Without exception the field contains information that can be moved to attributes. In reporting, all this information could be moved to ATTRIBUTES, except for "Markings." This information could be moved to the field that is currently, in OC, called "Content Remarks."
    • In reporting, Phys_remarks info appears in the same cell. The attribute value appears after the attribute type. These are the This is the complete list of attribute types:
      • Format Gauge:
      • ModelNumber:
      • Material:
      • Markings:
      • Serial Number:
      • Weight:Repeated values
    • If one of these attribute fields contains multiple values, these are recorded in the report as separated by semicolons. The fields themselves are separated from other data in the field by a single space. Example:
      • Material: Tin; porcelain; wood ModelNumber: 1912Within
    • ONLY within the attribute field "materials," there are additional cases in which values are often separated by commas. Example:
      • Material: wood, metal, glass
    • "Markings" often contains a comma as a part of the value, rather than as a symbol that denotes two separate values. Example:
      • Markings: 'Nite Lite' Exclusively Distributed by LECO Electric Manufacturing Co., Florida, NY.  Copyright by Kagran Corporation.  Made in Japan.
      Condition/Artifact Needs - "Artifact Needs" information can be fed into "Condition" where it matches up with one of the four fields in "Condition."
  • CONDITION and ARTIFACT NEEDS:
    • Adding value to the "Condition" field:
      • Sometimes data that should be in "Condition" appears in the "Artifact Needs" field. See artifact needs report.
      • Adding value to the "Condition" field Where the exact phrase "Exhibitable/Needs Work" appears in "Artifact Needs," delete the phrase and the change value in "Condition" to 1.
      • Where the exact phrase "Needs No Work" appears, delete the phrase and change the value in "Condition" to 0.
      • Where the exact phrase "In Jeopardy/Unstable" appears, delete the phrase and change the value in "Condition" to 3.
      • Where the exact phrase "Not Exhibitable/Stable" appears, delete the phrase and change the value in "Condition" to 2.
      • Keep in mind that these These phrases occasionally appear in the field "Artifact Needs" with other text. The other text should not be deleted. If it is impossible to delete some text and retain other text in the field, then no text should be deleted.
    • Adding values for "Marking" and "Tagging"
      • Where the exact phrases "Marking," and "Needs marking" Taggingappear, check "needs marking" Needs marking," in Artifact Needs.
      • Where the exact phrases "Tagging" and "Needs tagging" appear, check "needs tagging" in Artifact Needs.
      • Where the exact phrases "Marking and tagging," "Marking & tagging," "Needs marking and tagging," and "Needs marking & tagging."" appear, check both "marking" and "tagging" in Artifact needs.
      • There are no other phrasings aside from those listed above, so this task can be automated.
    • Keep remaining values in "Artifact needs" and keep field.Accession date.  -
  • ACCESSION DATE: Delete this field entirely.Attributes.
  • ATTRIBUTES_:_ (Data below comes from looking at each field on this report of all objects with all attribute fields.)
    • The following attributes fields are available for inclusion in reports run on OBJECTS. These are: Category, Components, Copyright date, Copyright holder, Copyright Statement, Creation date, Credit line, Dimensions (in), Display Date, Extent, Format, Manufacturer, Material, Other physical details, Photo credit, Publication date, Publisher, Serial number, Subcategory, Subject, Technique
    • The following attributes fields are a part of the database structure but are NOT available in reports run on OBJECTS. These are: Color, Color/BW, Country, Creator, Date of birth, Date of death, Director, Distributor, Form, Genre, Historical notes, Key, Label and caption, Language, Licensor, Medium, Network/Cable service, Place name, Principle cast, Production company, Production date, Release date, Running dates, Running time. Most of these attributes are in use in other data sets (Entities, Occurrences, etc). It is unknown whether or not any object records have data in any of these fields, but it is unlikely (the database structure probably does not allow it).
    • All attribute fields are repeatable. When exporting this data via the OC "reporting" function, however, each object record has only one cell per attribute field. In cases in which a field has been repeated in an object record, both pieces of data will appear in a single cell separated by a semicolon. CATEGORY: We will be able to delete this category CATEGORY: Delete this attribute field after we finish the classification walkover.
    • COMPONENTS: Note: If we are using Excel to store data at any point in migrating this field: Be aware that Excel's auto-formatting will convert all page number values up to 12 p. (1 p. - 12 p.) automatically into times (1:00 PM - 12:00 PM). Simply reformatting the cells after exporting the data will not restore the data to its original form (instead, 12:00 PM becomes "0.5," and so on).
    • CREDIT LINE: This field is completely empty -- it holds no data. Delete.DIMENSIONSDelete this attribute field (contains no data).
    • DIMENSIONS: Move this data over to the "Dimensions" field in "Cataloging" on CS. Since "Dimensions" is repeatable on CS, we do not need to replace the data that is already in the field. (Note: Some of the data contained in this field repeats data in the "Dimensions" field in the Objects > Basic tab. Some of it does not, and instead provides different measurements for an additional component of the object (--e.g., the packaging for a toy, where the "Dimensions" field in Basic contains the dimensions of the toy itself). Since dimensions is repeatable in OC, we should export the data there.
    • All Dates:  This includes DISPLAY DATE, COPYRIGHT DATE, CREATION DATE, and PUBLICATION DATE: Many object records contain multiple dates.
    • EXTENT: It appears that in MOST cases the cataloger mistakenly put the data for "Components" into this field instead. None of the objects that have data in "Extent" have data for "Components," so we could move the data into the "Components" field without worrying about writing over more recent/relevant data. There are relatively few fields with data for "Extent" (68 total), so we could easily move this data by hand. Still must do all the extents from 2002.007..) 
    • EXTENT: Delete this attribute field (contains no data).
    • OTHER PHYSICAL DETAILS: Most of this data belongs in other fields (such as dimensions, components, copyright statement, etc), but there is no way to sort it automatically. Move this data by hand. Other data Some data remains that could be moved automatically:
      • Alternate dimensions (eg: "Bowl: Diameter 5.5 x 2; Cup: 4.25 x 2.25 x 2.25," or "Box: 1 x 7.25 x 7.25")
      • Clothing sizes (eg: "Adult size medium (38-40)," "Children's large.")
      • See "Other phsyical details" objects list for the remaining items. This may need to be moved by hand.
      • PUBLISHER: This field is blank. Delete
    • PUBLISHER: Delete this attribute field (contains no data).
    • SUBCATEGORY: We will be able to delete Delete this field after we complete completing the classification walkover.
    • SUBJECT: In all but a few cases this field is blank or contains irrelevant info. A few lots contain data we would like to preserve. I've listed these lots below along with information on the nature of the data and how we should migrate it. See subject report.
      • 1989.26: Lists associated individuals, productions, corporations, collections, and subjects in "Subject." Each of these strings of names is preceded by one of the following four phrases: "By Corporation Name:", "Miscellaneous:", "By Personality/Character Name:", "By Production Title:". However, these typologies only refer to the first item in each list and the list may contain other names, production names, and subject keywords. There are over 1000 objects, so it would be possible (though difficult) to move this data by hand. Here are a few examples: 
        • "By Personality/Character Name:  Lombard, Carole ; Carole Lombard; Sports/Exercise/Martial Arts; Home/Family Life"
        • "By Production Title:  STAR MAKER, THE; Linda Ware; Louise Campbell; Bing Crosby; Ned Sparks; Walter Damrosch (Conductor/ Actor); Gus Edwards (Songwriter); Teenagers"
        • "By Personality/Character Name:  Bennett, Constance; Constance Bennett; Joan Bennett; Barbara Bennett; Children/Babies"
      • 1986.59: "Subject" field gives more specific info than the "Subcategory" field. This info is not repeated elsewhere and would map better onto Work Types. The values listed in this lot under "Subject" include: Behind the Scenes Photo, Behind-the-Scene Photograph, Behind-the-Scenes, Behind-the-Scenes Still, Candid Portrait, Cast and Crew Photo, Production Photo, production photograph, Publicity Color Transparency, Publicity Portrait, Publicity Still, Scene Still, Scene Stills. The values listed under "Subcategory" include "Industry and Production Photographs" and "Scene Stills." There are 270 objects, so this data is easy to move by hand.
      • 1992.6: "Subject" field contains entity information. 95% of this information appears to have been entered by hand  into "Entities" for each record. I think it is safe to delete everything in "Subject" for this -- we may lose one or two terms, but that would be it.
      • 1994.2: Same as 1992.6.
      • 1995.7: Can delete all.
      • 1996.3: There is some data in "Subject" for many items in this lot but it is not consistent in nature. The lot is small and someone should go through it by hand in order to move the information to the proper fields.
      • 1998.5: Entity info in "Subject": this is a small lot and it's easier to move this information by hand.
      . I have already moved any relevant data into other fields in OC, save 1989.26 (in progress). After 1989.26 is completed, delete this attribute field and all the data in it.
    • TECHNIQUE: Can replace all instances of "lithograph" with "lithography.
  • History.
    • LOANS: In order to NOT migrate the "Loans" history data, we'll have to delete only the "history" records with a type_id of 1 (Loans).
    • LOCATION: Standardize format for "location"**** Mis-formatting 1: ".** There are 1,827 entries that have no colon between the storage area and the location code, and instead have a space.** Example: MST 7:5:2." There are 1,827 entries that are improperly formatted in this way. *** Delete the space between MST and the number string and add in its place a colon.
          Mis-formatting 2: "
      • There are nearly 200 entries that have a space between the first colon and the beginning of the numeric location code.
        • Example: MST: 7:5:2." There are nearly 200 entries that are improperly formatted in this way.
        • Delete the space between MST: and the number string. 
      • Another formatting change: Notes In many cases, there are notes listed in parentheses after the number string location code.
        **** BOU:2:2:8 (lamphouse only)
        • MST:6:3:6 (.1-.3)**** MST:12:5:6 (temporary)
        • In CS, all of these notes should to taken out of their parentheses and put as a note or specification to the location codeFix other formatting errors by hand. See history formatting spreadsheet.