Notes on Data Cleanup, M. Ahern
Key: Red is for areas in which I still need to run preliminary reports. Purple is for bits of data that will need to be moved by hand. The rest of the report describes data migration that can probably be automated. Whether or not it is worth automating this data migration will depend on how labor-intensive it is to migrate automatically versus by hand.
...
- WA (digital media): 6 object records.
- mfp (Main Photo File): 164 object records.
- OH (oral histories): 5 object records.
...
- For loans: Loan numbers for objects should be formatted like: "IL0001.0001." However, about 320 records are formatted like "IL-0004.0001." The hyphen should be removed.
...
Data Cleanup that can be automated. See also Objects data cleanup completed by hand and notes on OC objects data cleanup.
- ARTIFACT CLASS, WORK TYPE, COLLECTION CATEGORY: Many uncategorized objects can be categorized using the attributes "Category" and "Subcategory." See Classification Walkover (simple), and a more detailed, object-by-object walkover which shows how many objects are in each artifact class/work type. Many of these objects on this list can be classified by hand (where there are too few objects in a category to justify automating their classification).
- Many objects will have to be classified by hand by a cataloger before the migration, or else left undetermined. There is one list of assorted objects to classify and one list of books to classify. See full list of objects to classify and list of books to classify. Dimensions the Classification Walkover page for instructions.
- DIMENSIONS: - In OC, "Dimensions" is a single field containing up to three values. In CS, there will be multiple fields for dimensions that will be named according to type (e.g., height, width, depth, diameter, etc). We will need to separate each of the three values into its own field.
- ASSIGNING MEASUREMENT TYPES Assigning measurement types:
- In most cases, each value will go in its own field WITHOUT a measurement type. This is because, while most OC records list the values in a specific order (height, width, then depth), many records do not follow this rule.
- Fields that contain all three characters (H, W, and D): We can assign a measurement type to the value based on the character.
- In most cases, there is a space between the value and the letter, eg:
- 10 H x 11 W x 2 D, or
- 10" H x 11" W x 2" D
- In most cases, there is a space between the value and the letter, eg:
- In catalog records created after June 2011, there is no space after the value:
- 10h x 11w x 2d
- Fields that contain the word "Diameter" or "Dia": We can assign a measurement type to the value based on that word. (Given that the letter "D" repeats in both "D" for "Depth," as above, and in "Diameter," will it be possible to automate measurement type?)CS Field: Dimension
- Measured By: Data from the attribute field "Other Physical Details" should be moved here with the text incorporated into a note. **It's unclear whether or not there's a way to automate this.
- There is a CS Field for each dimension called "Dimension Measured By." Is there a way to pull this information from "Object Histories"?FORMATTING MEASUREMENT VALUES
- Formatting measurement values:
- In all cases we can convert fractions (1/2) into decimals (.5). If we do so, we will have to delete the space before the decimal, unless the fraction is the first value in the list of 2-3 lengths listed in the "dimensions" field. However, .
- Many decimal values are displayed in increments smaller than .25 (e.g. .125, .375), and many values include fractions in increments of smaller than 1/4 (e.g., 3/8 or 5/8).
- Many decimal values are displayed in increments smaller than .25 (e.g. .125, .375). Should we round these up or down?
- ASSIGNING MEASUREMENT TYPES Assigning measurement types:
- Extent - Many objects have no extent. (Extent report).
- Some of these objects have lot numbers but not individual accession numbers. Thus, the lot will contain multiple objects with the same 7-digit accession number.
- According to Megan, many of these lots contain extent info for the entire lot, listed in the access database order, in their "Administrative remarks" fields. However, this is not a consistent pattern. See extent report on lots or groups of objects that have inventory extent info from the access database in the admin notes field here. We need to go through this data by hand and try to assign it to its proper place. Finish cleaning up this data.
- Some of these objects have no accession number (that is, their accession number is listed as "unknown."). Move data for "unknown" object extents over from the old access database.
- For many of these objects, it is obvious from the title that the extent is "1." However, these records number in the thousands.
- Some of these objects have lot numbers but not individual accession numbers. Thus, the lot will contain multiple objects with the same 7-digit accession number.
- Registrar Status -
- This is the correct formatting for objects smaller than (x); for larger objects, these values should be rounded.
- ALL information in this field , so it would be difficult to move this information by hand. See report. Without exception the field contains information that can be moved to attributes. In reporting, all this information could be moved to ATTRIBUTES, except for "Markings." This information could be moved to the field that is currently, in OC, called "Content Remarks."
- In reporting, Phys_remarks info appears in the same cell. The attribute value appears after the attribute type. These are the This is the complete list of attribute types:
- Format Gauge:
- ModelNumber:
- Material:
- Markings:
- Serial Number:
- Weight:Repeated values
- If one of these attribute fields contains multiple values, these are recorded in the report as separated by semicolons. The fields themselves are separated from other data in the field by a single space. Example:
- Material: Tin; porcelain; wood ModelNumber: 1912Within
- ONLY within the attribute field "materials," there are additional cases in which values are often separated by commas. Example:
- Material: wood, metal, glass
- "Markings" often contains a comma as a part of the value, rather than as a symbol that denotes two separate values. Example:
- Markings: 'Nite Lite' Exclusively Distributed by LECO Electric Manufacturing Co., Florida, NY. Copyright by Kagran Corporation. Made in Japan.
- Adding value to the "Condition" field:
- Sometimes data that should be in "Condition" appears in the "Artifact Needs" field. See artifact needs report.
- Adding value to the "Condition" field
- Where the exact phrase "Exhibitable/Needs Work" appears in "Artifact Needs," delete the phrase and the change value in "Condition" to 1.
- Where the exact phrase "Needs No Work" appears, delete the phrase and change the value in "Condition" to 0.
- Where the exact phrase "In Jeopardy/Unstable" appears, delete the phrase and change the value in "Condition" to 3.
- Where the exact phrase "Not Exhibitable/Stable" appears, delete the phrase and change the value in "Condition" to 2.
- Keep in mind that these These phrases occasionally appear in the field "Artifact Needs" with other text. The other text should not be deleted. If it is impossible to delete some text and retain other text in the field, then no text should be deleted.
- Adding values for "Marking" and "Tagging"
- Where the exact phrases "Marking," and "Needs marking" Taggingappear, check "needs marking" Needs marking," in Artifact Needs.
- Where the exact phrases "Tagging" and "Needs tagging" appear, check "needs tagging" in Artifact Needs.
- Where the exact phrases "Marking and tagging," "Marking & tagging," "Needs marking and tagging," and "Needs marking & tagging."" appear, check both "marking" and "tagging" in Artifact needs.
- There are no other phrasings aside from those listed above, so this task can be automated.
- Keep remaining values in "Artifact needs" and keep field.Accession date. -
- The following attributes fields are available for inclusion in reports run on OBJECTS. These are: Category, Components, Copyright date, Copyright holder, Copyright Statement, Creation date, Credit line, Dimensions (in), Display Date, Extent, Format, Manufacturer, Material, Other physical details, Photo credit, Publication date, Publisher, Serial number, Subcategory, Subject, Technique
- The following attributes fields are a part of the database structure but are NOT available in reports run on OBJECTS. These are: Color, Color/BW, Country, Creator, Date of birth, Date of death, Director, Distributor, Form, Genre, Historical notes, Key, Label and caption, Language, Licensor, Medium, Network/Cable service, Place name, Principle cast, Production company, Production date, Release date, Running dates, Running time. Most of these attributes are in use in other data sets (Entities, Occurrences, etc). It is unknown whether or not any object records have data in any of these fields, but it is unlikely (the database structure probably does not allow it). All attribute fields are repeatable. When exporting this data via the OC "reporting" function, however, each object record has only one cell per attribute field. In cases in which a field has been repeated in an object record, both pieces of data will appear in a single cell separated by a semicolon. CATEGORY: We will be able to delete this category CATEGORY: Delete this attribute field after we finish the classification walkover.
- Alternate dimensions (eg: "Bowl: Diameter 5.5 x 2; Cup: 4.25 x 2.25 x 2.25," or "Box: 1 x 7.25 x 7.25")
- Clothing sizes (eg: "Adult size medium (38-40)," "Children's large.")See "Other phsyical details" objects list for the remaining items. This may need to be moved by hand.
- "By Personality/Character Name: Lombard, Carole ; Carole Lombard; Sports/Exercise/Martial Arts; Home/Family Life"
- "By Production Title: STAR MAKER, THE; Linda Ware; Louise Campbell; Bing Crosby; Ned Sparks; Walter Damrosch (Conductor/ Actor); Gus Edwards (Songwriter); Teenagers"
- "By Personality/Character Name: Bennett, Constance; Constance Bennett; Joan Bennett; Barbara Bennett; Children/Babies"
- LOANS: In order to NOT migrate the "Loans" history data, we'll have to delete only the "history" records with a type_id of 1 (Loans).
- LOCATION: Standardize format for "location"**** Mis-formatting 1: ".** There are 1,827 entries that have no colon between the storage area and the location code, and instead have a space.** Example: MST 7:5:2." There are 1,827 entries that are improperly formatted in this way. *** Delete the space between MST and the number string and add in its place a colon.
- Mis-formatting 2: "
- There are nearly 200 entries that have a space between the first colon and the beginning of the numeric location code.
- Example: MST: 7:5:2." There are nearly 200 entries that are improperly formatted in this way.
- Delete the space between MST: and the number string.
- Another formatting change: Notes In many cases, there are notes listed in parentheses after the number string location code.
**** BOU:2:2:8 (lamphouse only)- MST:6:3:6 (.1-.3)**** MST:12:5:6 (temporary)
- In CS, all of these notes should to taken out of their parentheses and put as a note or specification to the location code.
- Fix other formatting errors by hand. See history formatting spreadsheet.
- .
- There are nearly 200 entries that have a space between the first colon and the beginning of the numeric location code.