/
HAVRC Data Migration ETL Work

HAVRC Data Migration ETL Work

This page contains information about data migration ETL work for HAVRC's deployment of CollectionSpace

HAVRC Authorities

Migration of HAVRC legacy data is dependent upon the ability to create and deploy a number of authority files within the CollectionSpace environment.  Nearly every table in the current HAVRC system is dependent on data from one or more of these authorities.  All authorities must be able to create both hierarchical and associative relationships between records in the same authority.  Both preferred and non-preferred terms should be searchable and display in predictive text handling lists (see /wiki/spaces/collectionspace/pages/666274159 )  Once these basic authority structures have been developed and instances are available for use, the first step in the data migration process should be the migration of existing authority data into these structures.  Since many of the values that populate the HAVRC authorities (such as personal and corporate Names, Place names, Work types, Styles, Periods, Materials and Techniques, etc.) were derived from the licensed Getty vocabularies (http://www.getty.edu/research/tools/vocabularies/index.html),  an ideal situation going forward would be the ability to access the online version of these thesauri from within the CSapce application via Web services (see http://www.getty.edu/research/tools/vocabularies/obtain/download.html) for both searching and data retrieval.  This would eliminate the need to maintain the entire thesaurus structure within the application and take the advantage of continuing product development by the Getty.

HAVRC Names

The HAVRC Names table currently contains 7289 names.  Each name is typed with one of 19 name types.  HAVRC alt_name currently contains 75 records representing alternate names, also typed, that are linked to records in the names table.  Both name and alt_name records will map to the person and organization authority.  Donor names, staff names, and patron names exist in the donor, staff and patron tables and should be mapped into the CSpace Name Authority, preferably as separate vocabularies from artist names.  Donor and staff names will have to be parsed into first and last names.

Major Issues

  • HAVRC Names table combines Name and role in one record, resulting in multiple records for the same artist name with different roles associated (e.g. Rubens as painter, designer, architect, sculptor, printmaker, etc.)  Multiples need to be identified so that the name may be migrated once, with all associated roles mapped as multiple values in the repeatable Occupation field.

Customizations required:

  • Name block needs to be a repeating group to allow for the inclusion of alternate names. (fixed in Ver. 2.4) 
  • Additional fields needed for repeating data source block for name data source and source_id (ULAN id, LCNAF id, etc.).  (fixed in Ver. 2.4)
  • Source names should ultimately be derived from the Brief Citation field of the Citation Authority.
  • CSpace birth date and death date fields need to be structured dates rather than calendar dates.
  • Additional (structured) date fields needed to record activity dates. Best approach would be a repeating activity dates group with earliest_active, latest active, and activity place. (See wireframe attached to JIRA CSPACE-4996)
  • Additional Qualifier field needed in Production Person Group in the Catalog record to hold "School of", "Circle of", "Follower of", etc.
  • May need a local field to hold HAVRC name_ids to aid migration?

  

HAVRC Table Name

HAVRC Field Name

HAVRC Name Type

CSpace Field Name

Notes

xdb_names_qv

preferred_name

personal

Display name

Preferred names with qualifiers included such as "(File with)", "(After)", or "(Attr)" need to be isolated.  These records should map the name, qualifier, and role to the catalog record of the work they are linked to as a production person (e.g. Production Person = "Rubens, Peter Paul", Production Role = "painter"  Qualifier = "After"

 

last_name

personal

Surname

 

 

middle_name

personal

Middle name

 

 

name_type_description

 

 

Determines whether the data from this record goes into the Person or Organization Authority (see name types mapping document)

 

name_abbrev

personal

 

This field was used to save real estate on slide labels.  While no longer needed, some data cleanup is required here.  Some of the values in this field are alternate names

 

qualifier

personal

Addition to name

Data cleanup required here. Many of these values (Circle of, follower of, workshop of) denote unnamed individuals whose work displays affinities with those of the named artist.

 

honorific

personal

Title

This field currently contains 26 unique values and needs some cleanup 

 

ulan_id

 

 

New field needed for names derived from ULAN.  Could be part of a source + source_detail group (Source="ULAN" and Source_detail="500002921")

 

call_num_cd

 

 

Call number code associated with analog collection.  Recommend keeping this is a local field for now.  May be useful for sorting all the attributeds, circle of, school of names.  Perhaps a local field could be added to the person and org authorities to hold this legacy data if pre-migration data cleanup is not possible.

 

birth_date

personal

Birth Date

 

 

death_date

personal

Death Date

 

 

first_active_date

personal

 

New field group needed here made up of structured date + place name

 

last_active_date

personal

 

 

 

source_abbrev

personal/organization

 

Need new source and source detail group here (See ulan_id above) (fixed in Ver. 2.4)

 

notes

personal

Name notes

 

 

culture_name

personal

Nationality

 

 

lang_name

personal/organization

 

Need new field to record language of the name (Fixed in Ver. 2.4)

xdb_name_role_cv

name_role

 

Occupations

 

sql for xdb_names_qv
create view xdb_names_qv (
name_id,
last_name,
first_name,
middle_name,
name_type_id,
name_type_descr,
name_abbrev,
birth_date,
early_birth_date,
late_birth_date,
death_date,
early_death_date,
late_death_date,
cat_term_id,
catalog_term,
call_num_cd,
first_active_date,
early_first_active,
late_first_active,
last_active_date,
early_last_active,
late_last_active,
gender,
culture_id,
culture_name,
complete_flag,
source_id,
source_type,
source_abbrev,
lang_id,
lang_name,
honorific,
qualifier,
ulan_id,
header_id,
header_text,
header_descr,
name_notes,
archive_call_num,
preferred_name,
last_chng_staff_id,
last_chng_staff_login,
last_chng_date
)
as
select t1.name_id, t1.last_name, t1.first_name, t1.middle_name,
       t2.name_type_id, t2.name_type_descr, t1.name_abbrev,
       t1.birth_date, t1.early_birth_date, t1.late_birth_date,
       t1.death_date, t1.early_death_date, t1.late_death_date,
       t6.cat_term_id, t6.catalog_term, t1.call_num_cd,
       t1.first_active_date, t1.early_first_active, t1.late_first_active,
       t1.last_active_date, t1.early_last_active, t1.late_last_acti
       t1.gender, t3.culture_id, t3.culture_name, t1.complete_flag,
       t4.source_id, t4.source_type, t4.source_abbrev,
       t5.lang_id, t5.lang_name, t1.honorific, t1.qualifier,
       t1.ulan_id,  t7.header_id, t7.header_text,
       t7.header_descr, t1.name_notes, t1.archive_call_num,
       t1.preferred_name, t1.last_chng_staff_id,
       t9.staff_login, t1.last_chng_date

from
       names t1, name_type t2, culture t3, source t4, language t5,
       mip_common..cataloging_terms t6, header t7,
       staff t9
where
       t1.name_type_id = t2.name_type_id and
       t1.culture_id = t3.culture_id and
       t1.source_id = t4.source_id and
       t1.lang_id = t5.lang_id and
       t1.cat_term_id = t6.cat_term_id and
       t1.header_id = t7.header_id and
       t1.last_chng_staff_id = t9.staff_id
sql for xdb_name_role_cv
create view xdb_name_role_cv 
(
  name_id,
  name_role
)
as
select
     t1.name_id,
     t1.first_name
     + substring(' ', isnull(sign(ascii(t1.first_name)),0), 1)
     + t1.middle_name
     + substring(' ', isnull(sign(ascii(t1.middle_name)),0), 1)
     + t1.last_name + ', ' + t1.qualifier
     + substring(', ', isnull(sign(ascii(t1.qualifier)),0), 2)
     + t2.catalog_term
from
     names t1, mip_common..cataloging_terms t2
where
    t1.cat_term_id = t2.cat_term_id 
HAVRC Place Names

Place names in HAVRC are largely derived from the Getty Thesaurus of Geographic Names (TGN) and are stored in the Locations table and in several views (www_locations, xdb_loc_qvnc, xdb_loc_acv, xdb_loc_qvoj).  Unlike the AAT and ULAN vocabularies (which was loaded into Sybase tables and made available within the HAVRC application), the values derived from the TGN are cut and pasted by cataloging staff from the public website maintained by the Getty Vocabulary Program (http://www.getty.edu/research/tools/vocabularies/tgn/) into the Locations table.  Hierarchical relations are created one by one by cataloging staff.    Because all the Getty vocabularies are continually growing and changing, and data input errors are inevitable, some of this place name data may be incorrect or out of date.  Unlike the AAT and ULAN data, the TGN data has never been refreshed or updated annually from Getty-provided materials.  Therefore, it may be advisable to run a verification check against the most current TGN data to identify and correct data errors before importing into CSpace.

A certain amount of data massage may be necessary on import to create consistent qualified names and display names out of HAVRC legacy data.  For example, homonyms could have qualified names that include the parent name appended in parentheses, e.g. "Cambridge (Massachusetts)" vs "Cambridge (England)".  Display names should have the qualified name plus name type, e.g. "Cambridge (Massachusetts, inhabited place)".  Display Names for place names that have no homonyms and no qualifiers could be comprised of place name + place name type, e.g. "India (nation)".  The place name qualifiers are necessary for disambiguation when place names are presented in a term completion picklist with no other information.  The display names are likely to get long and unwieldy unless abbreviations are used for the parent terms, e.g. "Cambridge (MA, inhabited place)", but these abbreviations have not been entered consistently and many would have to be added or created on export.

Data Cleanup Needed

  • Isolation of HAVRC location terms that already include the qualifier in the location term string (e.g. "Castello (sestiere)" ) from those that include alternate names (e.g. "Abu Sunbul (Abu Simbel)").  Currently there are 205 place name records that include something in parentheses following the place name.
  •  Parsing of HAVRC location terms that include alternate place names in the string, e.g. "Abu Sunbul (Abu Simbel)" to create a preferred name and an alternate name.

HAVRC Table Name

HAVRC Field Name

CSpace Information Group

CSpace Field Name

Notes

Locations

loc_id

Name Authority Body

Identifier

 

www_locations

loc_term + loc_type_descr

Place names

Qualified name

loc_term + "(" + loc_type_descr + ")" - Example Cambridge (Massachusetts)

www_locations

loc_term + loc_type_descr

Name Authority Body

Display_name

Display name should be comprised of Qualified name + name type.  Some place names have alternate names (enclosed in parentheses) included in the loc_term field.  These need to be parsed and the primary place name entered into the preferred name field and alternate place name entered into a repeating non-preferred name element.

www_locations

 

Place names

Name

Place names without alternate names or qualifiers

www_locations

loc_type_descr

Place names

Name type

 

Locations

loc_abbrev

Place names

Place Name Abbreviation

Not all location have abbreviations

Locations

term_type

Place names

Status

Controlled list (c=Current, h=Historical, b=Both)

xdb_loc_qvnc

parent_term

Hierarchy

Broader Context

Hierarchy links made after all location terms have been migrated

Location

latitude

Locality Information, Georeference Information

Verbatim latitude, Decimal latitude

 

Location

longitude

Locality Information,
Georeference Information

Verbatim longitude, Decimal longitude

 

Location

source_abbrev

Place Authority body, Place names

Source, Source

(In repeating term block in Ver. 2.4)

 

tgn_id

Place names

Source ref_id

Only for records with TGN in source_abbrev (In repeating term block in Ver. 2.4)

 

loc_notes

Place Authority body

Note

 

 

child_term

Hierarchy

Narrower context

 

 

 

 

Coordinate system

All records should have "Latitude and Longitude" value

Location

Latitude + longitude

Locality information

Verbatim coordinates

Concatenate latitude +" X " + longitude

sql for xdb_loc_qvnc
create view xdb_loc_qvnc
(
loc_id,
loc_term,
loc_type_id,
loc_type_descr,
loc_abbrev,
loc_hier,
parent_id,
parent_term,
par_loc_type_id,
par_loc_type,
term_type,
latitude,
longitude,
source_id,
source_type,
source_abbrev,
tgn_id,
staff_id,
staff_login,
date_entered,
last_chng_staff_id,
last_chng_staff_login,
last_chng_date,
loc_notes,
child_term,
child_id
)
as
select t1.loc_id, t1.loc_term, t2.loc_type_id, t2.loc_type_descr,
       t1.loc_abbrev, t1.loc_hier, t1.parent_id, t3.loc_term,
       t5.loc_type_id, t5.loc_type_descr,
       t1.term_type, t1.latitude, t1.longitude,
       t4.source_id, t4.source_type, t1.tgn_id,
       t4.source_abbrev, t1.staff_id, t7.staff_login,
       t1.date_entered, t1.last_chng_staff_id,
       t6.staff_login, t1.last_chng_date, t1.loc_notes,'', 0
from
       location t1, loc_type t2, location t3, source t4,
       loc_type t5, staff t6, staff t7
where
       t3.loc_id = t1.parent_id
       and t2.loc_type_id = t1.loc_type_id
       and t4.source_id = t1.sou

        rce_id
       and t3.loc_type_id = t5.loc_type_id
       and t1.staff_id = t7.staff_id
       and t1.last_chng_staff_id = t6.staff_id 
Citations

Because the HAVRC collection consists of images of objects, rather than the objects themselves, it is important for HAVRC staff to document the sources from which data about the object(s) depicted in the images are derived.  Data is gathered from a variety of (primarily) secondary sources, and nearly every table in the HAVRC application has two fields to reference these sources--Source abbreviation and source type.  While most cited sources are print publications (books, journals, exhibition catalogs and ephemera), data is also gathered from a variety of other sources, including websites, documentation provided by image vendors, information found on analog slide and photo labels, faculty input (oral history), etc.  Details about HAVRC sources are recorded in one of 7 source type tables: Books, Journals, Vendor Images, Donors/Duplicates, Electronic, Exhibition, and Other Sources.  Each table has a different field structure to hold data relevant to that source type.  Data in these tables will all map to a common structure in the CSpace Citations Authority.  The only field they all share in common is a field called source_abbreviation.  It is this source_abbreviation field + the source type that links the information about the source to the data recorded in other tables.  

Initial migration might simply map the source_abbreviation and source_id into the Brief Citation and Local Source Id fields in the Citation Authority, with the rest to follow based on the source_id.  Full citations are created for the nightly snapshot of data for the HAVRC website and may be migrated from www_source_citations to the CollectionSpace Source Full Citation field for all legacy sources.  Once the legacy source data is migrated, these full citations will have to be either entered by hand, or copied and pasted from a citation string generated from an online service such as WorldCat or Refworks for all new citation authority records.  Data in the 7 source type tables should be mapped as follows to the UCB Proposed Citations Authority Schema.

Customizations needed:

  • Additional field to hold legacy source_id to aid migration?
  •  

    HAVRC Table Name

    HAVRC Field Name

    CSpace Information Group

    CSpace Field Name

    Notes

    Source

     

    Citation Autuhority

     

     

     

    source_abbrev

    Citation Information

    Brief Citation

     

     

    source_id

    Citation Information

    Local Source Id (Custom field)

    Either create a custom field or map to Other Resources-> Source Other Identifier with a Source Other Identifier type="legacy_source_id"

     

    source_type

    Citation Information

    Source Type

    b=book, j=journal, v=vendor, d=donor/duplicate, m=electronic media, e=exhibition, o=other.  Note: other suggested source types are more granular but these values will speed the initial data migration.

    www_source_citations

    cite

    Citation Information

    Source Full Citation

    The cite field in this table contains a fully formatted citation for each accession (including page and plate numbers).  SQL code used to populate this field for the website data (format_cdwa_source?) could be edited to populate this field with legacy data (i.e. edit the part of the code that adds page and plate numbers and put that page&plate number data in the source detail field of the media handling record for that accession.  Full citation for new citation records added post-migration will have to be constructed manually for this field or generated and pasted from an external application such as Refworks.

    Book

     

    Citation Authority

     

    For all those source records where source_type=b

     

    source_id

    Citation Information

    Local Source Id (Custom field)

    Either create a custom field or map to Other Resources-> Source Other Identifier with a Source Other Identifier type="legacy_source_id"

     

    author

    Person/Org Authority

    Display Name

    Many HAVRC book records contain multiple authors in the author field.  This data will need to be parsed and imported as separate names and entered into the Person Authority, or separated manually post-migration.

     

    book_title

    Title and Issue Information

    Source Title

    447 HAVRC book titles contain subtitle or section title info.  Data will either need to be parsed and imported separately, or done so manually post-migration.

     

    series_title

    Title and Issue Information

    Source Title

    Create new Citation record for the series title with Source Type="Series". Link as broader context in Hierarchical Relationships to record where series title appears in HAVRC series_title field.

     

    edition

    Title and Issue Information

    Source Edition

     

     

    publish_location

    Publisher Information

    Source Publication Place

     

     

    publisher_name

    Publisher Information

    Source Publisher

     

     

    publish_year

    Title and Issue Information

    Source Publication Year

     

     

    book_call_num

    Resource Identifiers

    Source Other Identifier

    Map with "Call number" in Source Other Identifier Type

     

    book_volume

    Title and Issue Information

    Source Volume

     

     

    book_abbrev

     

     

    From Source table (see above)?

     

    book_comments

    Publisher Information

    Remarks

     

     

    ISBN

    Resource Identifers

    Source Other Identifier

    Map with "ISBN" in Source Other Identifier Type

     

    date_entered

    Citation Information

    Source Data Capture Date

     

    Journal

     

    Citation Authority

     

    For all those source records where source_type=j

     

    source_id

    Citation Information

    Local Source Id (Custom field) 

    Either create a custom field or map to Other Resources-> Source Other Identifier with a Source Other Identifier type="legacy_source_id"

     

    journal_title

    Title and Issue Information

    Source Title

     

     

    journal_volume

    Title and Issue Information

    Source Volume

     

     

    journal_issue

    Title and Issue Information

    Source Issue

     

     

    journal_year

    Title and Issue Information

    Source Publication Year

     

     

    journal_month

    Title and Issue Information

     

    No real place for this in new schema unless added to Source Issue. Example: For Art Bulletin volume LXV issue 3, published in September of 1983:  in Source Issue put "3 (SEP)"

     

    journal_abbrev

    Citation Information

    Brief Citation

     

     

    ISSN

    Resource Identifers

    Source Other Identifier

    Map with "ISSN" in Source Other Identifier Type

     

    journal_call_num

    Resource Identifiers

    Source Other Identifier

    Map with "Local Call Number" in Source Other Identifier Type

     

    journal_comment

    Publisher Information

    Source Remarks

     

     

    date_entered

    Citation Information

    Source Data Capture Date

     

    Vendor Images

     

    Citation Authority

     

    For all those source records where source_type=v. Vendor images purchased both individually and in sets.  Data from Vendor table maps to Org Authority.  Data from vendor_set (or xdb_vendor_set_qv) to Citation Authority

    xdb_vendor_set_qv

    source_id

    Citation Information

    Local Source Id (Custom field)

    Either create a custom field or map to Other Resources-> Source Other Identifier with a Source Other Identifier type="legacy_source_id"

     

    vendor_name

    Publisher Information

    Source Publisher

    Vendor name maps to either Person or Org Authority.  Will need custom map for legacy data

     

    vendor_set_abbrev

    Citation Information

    Brief Citation

    If set_flag is "No" map only vendor_order_abbrev to Brief Citation. 

     

    vendor_item_id

    Title and Issue Information

    Source Title

    If set_flag is "yes" map vendor_item_id to Title and Issue Information - Source Title

     

    set_comments

    Publisher Information

    Source Remarks

     

     

    set_flag

    Citation Information

    Vendor Set (local boolean field?)

    Determines whether the citation refers to an individual image or a set of images.  Use for migration of legacy data only or put in custom local field "Vendor Set"

     

    date_entered

    Citation Information

    Source Data Capture Date

     

    Donors/Duplicates

     

    Citation authority

     

    For all those source records where source_type=d.  Most donor names map to Person Authority, a few to Org Authority.  Need to cull duplicate names entered with different donation types (e.g. same name entered as source for donated slide, duplicate slide, and oral history, etc.)

    donor

    source_id

     

    Citation Information

    Either create a custom field or map to Other Resources-> Source Other Identifier with a Source Other Identifier type="legacy_source_id"

     

    Local Source Id (Custom field) 

     

     

    donor_name

    Person Authority

    Person Display Name; Forename; Surname

    Person and Org names will need to be separated manually.  After dups are culled, map donor_name to Person displayName, parse last name and first name at comma and place in Forename and Surname fields of Person Authority.  For orgs, map donor_name to Organization displayName

     

    date_entered

    Citation Information

    Source Data Capture Date

     

    xdb_donor_qv

    credit_line

    Media Handling

    Copyright Statement

    Any data in credit line maps to media handling records for the accessions that point to this source_id as image source.

     

    copyright_owner

    Media Handling

    Rights Holder

    If copyright_owner=1 ("yes") then map name to Rights Holder field of Media Handling record for each accession linked to this source_id as image source.

     

    donor_comments

    Publisher Information

    Source Remarks

     

    Electronic Resources

     

    Citation Authority

     

    For all those source records where source_type="m"

    xdb_electronic_media_qv

    source_id

    Citation Information

    Local Source Id (Custom field) 

    Either create a custom field or map to Other Resources-> Source Other Identifier with a Source Other Identifier type="legacy_source_id"

     

    electronic_media_abbrev

    Citation Information

    Source Brief Citation

     

     

    electronic_media_title

    Title and Issue Information

    Source Title 

     

     

    publisher_name

    Publisher Information

    Source publisher

     

     

    publish_location

    Publisher Information

    Source Publication Place

     

     

    publish_year

    Title and Issue Information

    Source Publication Year

     

     

    url

    Resource identifiers

    Source Other identifier

     

     

    medium

    Citation Information

    Source type

     

     

    data_format

     

     

    add to comments?

     

    series_title

     

     

    no data

     

    media_count

     

     

    only 1 record contains a value - add to comments

     

    media_call_num

     

     

    no data

     

    system_requirements

     

     

    add to comments?

     

    media_size

     

     

    no data

     

    electronic_media_comments

    Citation Information

    Source Remarks

     

     

    date_entered

    Citation Information

    Source Data Capture Date

     

    Exhibition

     

    Citation Authority

     

    For all those source records where source_type=e.  Most data in this table will map to Exhibition Authority

    www_source_citations

    source_id

    Citation Information

    Local Source Id (Custom field)  

    Either create a custom field or map to Other Resources-> Source Other Identifier with a Source Other Identifier type="legacy_source_id"

    xdb_exhibit_qv

    exhibit_abbrev

    Citation Information

    Source Brief Citation

     

    xdb_exhibit_qv

    exhibit_title

    Citation Information

    Source Title

     

    xdb_exhibit_qv

    exhibit_item_type

    Citation Information

    Source Type

     

     

    date_entered

    Citation Information

    Source Data Capture Date

     

    Other Sources

     

    Citation Authority

     

    For all those source records where source_type=o

    other_sources

    source_id

    Citation Information

    Local Source Id (Custom field)   

    Either create a custom field or map to Other Resources-> Source Other Identifier with a Source Other Identifier type="legacy_source_id"

     

    other_abbrev

    Citation Information

    Source Brief Citation 

     

     

    other_descr

    Citation Information

    Source Remarks

     

    xdb_other_sources_qv

    pre_seurat_order_num

    Resource Identifiers

    Source Other Identifier

     

     

    format_descr

    Resource identifiers

    Source Other Identifier Type

     

     

    date_entered

    Resource Identifier

    Source Identifier Capture Date

     

Cultures
Materials
Processes and Techniques
Styles and Periods
Subjects

HAVRC Orders Data Mapping to CSpace Intake

HAVRC Works Data Mapping to CSpace Cataloging

Simple vs Complex Works

Works of art that consist of a single object are expressed in HAVRC as a single record in the works table, to which one or more names, dates, places, materials, techniques, subjects, and accession records may be linked via the work_id created by the system when a new work record is created. Works of art that consist of one or more component parts are expressed in HAVRC by links between the works and work_component tables. A sequential component_id is created by the system whenever a component record is saved and linked to the parent work via the work_id recorded in the work_component table. True component works cannot be completely understood without inheriting data from the parent work and this is what distinguishes these part-whole relationships from other associative relationships. For these more complex works, one or more names, dates, places, materials, techniques, subjects and accessions may be linked to each component work as well as to the parent work.

There are currently 1082 work records that have a sum total of 4058 linked component records. Until CollectionSpace has implemented the ability to create relationships between collection objects, HAVRC will need to employ a workaround to capture the relationships between parent works and their component parts as PAHMA has done (see PAHMA Collection Objects Data Mapping - Object-Object Relationships).  Additional local fields will likely be needed in order to prevent data loss and/or corruption during the initial migration and additional functionality will be needed going forward to preserve and enable data inheritance between works and their components.

HAVRC Accessions Data Mapping to CSpace Media Handling