With this iteration, we are targeting the data migration mappings for the migration of SMaSCH data into a CollectionSpace instance, based on the version 2.3 release of CollectionSpace. Although there will be additional changes supported in the future, this is the major migration and launch release.
See attachment for accession mapping. The grid below is not complete. |
SMaSCH |
CollectionSpace object schema |
CS User Interface Title (** = different from UI) |
Note |
accession.accession_id |
object_number |
Accession Number ** |
Change field title |
institution.inst_name |
collection |
Collection |
Dropdown (Collection values) |
accession.coll_num_prefix + coll_number + coll_num_suffix |
other_number |
Number |
Moved to Field Collection Number. |
"collector number" |
other_number_type |
Number Type |
Not needed if we move this to Field Collection Number. |
accession.notes |
comments |
Comments |
|
objkind.kind |
form (Object Description Information) |
Form |
Dropdown (Form values) |
taxon_fullname.fullname |
title |
Taxon ** |
Change field title |
accession.phenology (decoded) |
phase (Object Description Information) |
Phenology ** |
Change field title. Dropdown (Phenology values) |
accession.early_jdate (calc) |
field_collection_date_earliest |
Field Collection Date Earliest |
Custom |
accession.late_jdate (calc) |
field_collection_date_latest |
Field Collection Date Latest |
Custom |
accession.datestring |
field_collection_date |
Field Collection Date |
We are converting this to the structured date format |
committee.committee_abbr |
field_collection_collector |
Field Collection Collector |
|
accession.loc_lat_decimal |
field_loc_lat_decimal |
Field Location Latitude Decimal |
|
accession.loc_long_decimal |
field_loc_long_decimal |
Field Location Longitude Decimal |
|
accession.loc_place (calc) |
field_collection_place (?) |
Field Location Verbatim** |
Will map to verbatim field in schema extension for now. When the Place Authority is available, we will work on creating authority records as well. |
accession.loc_county |
field_loc_county |
Field Location County |
Custom |
accession.loc_state |
field_loc_state |
Field Location State |
Custom |
accession.loc_country |
field_loc_country (temp custom) |
Field Location Country |
Custom |
accession.loc_elevation |
field_loc_elevation (temp custom) |
Field Location Country |
Custom |
accession.coll_num_prefix + coll_number + coll_num_suffix |
? |
Field collection number |
|
committee.committee_abbr |
? |
Field collection collector |
|
"catalog date" |
date_association |
Date Association |
Fixed text. Do not load into 1.0. May end up in Associated Date information eventually, or create date/modify date. |
accession.catalog_date |
catalog_date |
Date Text |
Fixed text. Do not load into 1.0. May end up in Associated Date information eventually, or create date/modify date. |
Using three views to get subsets of the records, identical in their selects, except for the beginning of their accession number:
These views make the minimal joins needed since refnames (collectors, collection number assignors, filed_in_cabinet genus) will be looked up using text files coming from earlier Talend jobs.
See these links
Combines the name as filed and the annotation_history table from SMASCH.
See the determinations worksheet in the attached accessions mapping document
With some dates fixed, all determinations are now in cspace_determ_2_3.
Variant determinations: vname_id field recalculated to be "unverified"||vname_id in order to avoid collisions with taxon_id
Talend job MERGE_COLLOBJ_DETERMINATIONS (0.2)
Run job three times to modify SQL on view to get subsets: UC, JEPS, OTHER. These will map against Collection Object subsets. |
JEPS:
where object_number like 'JEPS%'
order by object_number, legacy_id
UC:
where substring(object_number, 1, patindex('%[A-Z][0-9]%', object_number)) = 'UC'
order by object_number, legacy_id
OTHER:
where substring(object_number, 1, patindex('%[A-Z][0-9]%', object_number)) in ('UCSC', 'UCLA', 'UCSB', 'DHN')
order by object_number, legacy_id
Inputs
View modified to remove time from aa.determined_date.
See the type kind worksheet in the attached accessions mapping document
Using this query with view
SELECT legacy_id spectype_id, rtrim(object_number) accession_id, type_kind, REFERENCE, figure, determined_by_id, determined_by, notes, basionym_id, verified_basionym FROM dbo.cspace_spectype_2_3 where object_number <> '' order by object_number, legacy_id desc |
create view cspace_spectype_2_3 as
select
ti.typeinfo_id as legacy_id,
rtrim(ti.accession_id) as object_number,
tk.kind as type_kind,
case
when ti.place_pub is not null and ti.place_pub != ''
and ti.year_pub is not null and ti.year_pub not in ('', '0', 'unkn')
then ti.place_pub + ', ' + ti.year_pub
when (ti.place_pub is null or ti.place_pub = '')
and ti.year_pub is not null and ti.year_pub not in ('', '0', 'unkn')
then year_pub
else null
end as reference,
ti.figure as figure,
ti.typ_committee_id as determined_by_id,
c.committee_abbr as determined_by,
case
when ti.notes is null and v.descr is not null
then 'Type Voucher: ' + v.descr
when ti.notes is null and v.descr is null
then null
when ti.notes is not null and v.descr is not null
then ti.notes + '; Type Voucher: ' + v.descr
when ti.notes is not null and v.descr is null
then ti.notes
end as notes,
ti.basionym_id as basionym_id,
tf.fullname as verified_basionym
from
types_info ti,
typekind tk,
committee c,
voucher v,
taxon_fullname tf
where ti.typekind_id *= tk.typekind_id
and ti.typ_committee_id *= c.committee_id
and ti.accession_id *= v.accession_id
and v.vouchkind_id = 56
and ti.basionym_id *= tf.taxon_id
Talend job MERGE_COLLOBJ_SPECTYPE
Note that in SMaSCH, a table named voucher was used to capture text notes of various kinds. Many of these are being cleaned up and put in other places, but some will remain.
See vouchers worksheet in attached accession mapping document
SELECT object_number, annotation_type, annotation_note, annotation_date FROM dbo.cspace_anno_2_3 ORDER by object_number, annotation_date |
create view cspace_anno_2_3 as select rtrim(v.accession_id) as object_number, vk.kind as annotation_type, v.descr as annotation_note, v.voucher_date as annotation_date from voucher v, vouchkind vk where v.vouchkind_id *= vk.vouchkind_id and v.vouchkind_id in (24, 29, 41, 56, 73, 78, 79) union select rtrim(ti.accession_id) as object_number, 'holotype location' as annotation_type, ti.inst_abbr as annotation_note, ti.mod_date as annotation_date from types_info ti where ti.inst_abbr is not null and ti.inst_abbr != '' union select aa.accession_id as object_number, 'additional taxa' as annotation_type, tf.fullname as annotation_note, aa.folder_name_date as annotation_date from annotation_asfiled aa, taxon_fullname tf where aa.addl_taxon_id *= tf.taxon_id and addl_taxon_id is not null and addl_taxon_id != 0 |
Using the SQL above directly for now. Rtrim the accession_id.
Talend job MERGE_COLLOBJ_ANNOTATIONS
From annotation_asfiled table
See annotations tab in attached accessions mapping document
SELECT annotation_asfiled.accession_id, annotation_asfiled.parent_name_id, (case when parent_name_modifier_kind.taxon_modifier=' ' then null else parent_name_modifier_kind.taxon_modifier end) as parent_modifier FROM annotation_asfiled, parent_name_modifier_kind where annotation_asfiled.parent_name_modifier_id = parent_name_modifier_kind.taxon_modifier_id and parent_name_id <> 0 order by accession_id, parent_seqno |
Talend job MERGE_COLLOBJ_HYBRIDPARENTS
From accession and voucher tables
See voucher worksheet in attached accession mapping document.
SELECT legacy_id, other_number, other_number_type FROM dbo.cspace_othernum_2_3 ORDER by legacy_id |
create view cspace_othernum_2_3 as select rtrim(v.accession_id) as legacy_id, v.descr as other_number, case when v.vouchkind_id = 45 then 'GenBank Code' when v.vouchkind_id = 55 then 'Other Label Number' when v.vouchkind_id = 71 then 'U.C. Botanical Garden Accession Number' end as other_number_type from voucher v where v.vouchkind_id in (45, 55, 71) |
Talend job MERGE_COLLOBJ_OTHERNUM
45 = 'genbank code', 55 = 'other label numbers', 71 = 'U.C. Botanical Garden'
Use SQL to update collectionspace_core:uri for existing collection objects.
There are approximately 99 collection objects in ucjeps.cspace already. All are UC. Check to see if any have been modified in CSpace and update as needed.
UCJEPS needs approximately 12 new values in a dynamic controlled list that is used by both loans in and loans out. These can be loaded via POST commands and should be ready to go.
See attachment for field mapping. This grid is outdated. |
SMaSCH |
CollectionSpace loans out schema |
CS User Interface Title (** = different from UI) |
Note |
loan_event.uc_loan_num + loan_event.jeps_loan_num |
Loan out number |
|
smasch has two types of loan identifiers. |
loan_event.curr_inst |
Borrower |
|
Probably should keep this as borrowing institution (organization authority) |
loan_event: |
Borrower's contact |
|
Probably should keep this as borrowing agent (person authority) |
loan_event.inhouse_notes + loan_event.noteworthy_inclusions |
Loan out note |
|
|
loan_event.loan_status + |
Loaned object status |
|
It looks like this field is not currently active. smasch keeps track of two status types: |
Loan date? |
|
Loan Out Date |
Does SMaSCH have a loan date? |
view:
From loan_accn_ids table
From several fields in loan_event table
From return_detail table
From transfer_event table and possibly from loan_event table (orig_inst, orig_for_lname, orig_for_fname, orig_for_position)
These will be added as detail items. Relationship records might not be needed.
Relationship records need to be created between separate loan out records to handle UC and JEPS loans that have the same loan_id.
See attachment for field mapping. |
From several fields in borrow table
From borrow_returns table
From several fields in borrow_xfer_out table
There are 20 records in borrow_return whose borrow_id does not match with one in the main borrow table. They look like they could match on lender_loan_num and institution. It will probably be easiest to update these by hand.
There are3 records in borrow_xfer_out whose borrow_id does not match with one in the main borrow table. They look like they could match on lender_loan_num and institution. It will probably be easiest to update these by hand.
Use annotation_asfiled.collectionkind_id (Name of Collection, e.g., Main, Reference, Type) to indicate storage of specimen
See agents worksheet in authorities mapping file for current field mapping. |
View: cspace_person_2_3.
Note: View modified to exclude the persons from loans and borrows. Those are now handled separately.
Talend job GET_PERSON_ADDS adds new person records.
Updates to be found later.
Obsolete 5/18: Talend job diff_person3 compares SMaSCH view (cspace_persons_2_0) to ucjeps.cspace (persons_common) on legacy_id, displayname, bio_notes, birthdate, and deathdate. May 12: Found 3 new records and 2 updates. This can also be confirmed by viewing person table in SMaSCH in table order, noting the records added or updated in reverse chronological order.
Talend job GET_PERSON_V1 (Hoffman) was used to load data initially:
Status: SMaSCH persons records already loaded, plus persons derived from borrow and loan tables. However, these might have been updated: Person records might be updated, and borrow and loan records might be added or changed. Person records last queried from SMaSCH on 3/8/2012 at 9:32.
Creates URI
New view persons_2_3 includes people from borrows, loans, transfers, etc, so new approach required.
Excluding records where forename and surname are both null.
Too many non-unique names picked up by the select distinct statements for the trans_person table. Need to fix.
Those records are static from transaction tables, so they will have to be refreshed when it's time to switchover!
ADD records can be identifying by comparing legacy_id in SMaSCH view to person_common.shortidentifier in CSpace.
UPDATES can be found two ways: 1) using my person_diff approach AFTER new records have been added or 2) doing a table scan on smasch.person and looking in database order.
ucjepspersonids-all.txt is used by many other jobs. Create it by concatenating the ID's output from original load (in CRH archive2 directory) with new adds from this batch.
Existing person records need collectionspace_core:uri. REST update does not work on the collectionspace_core schema, so this has to be done in SQL.
See if any are used already. Can use short identifiers in a certain range.
In SMaSCH, there are several unstructured person fields. These are being parsed and then loaded as unique person authority records.
From SMaSCH committee table where committee_func= "coll"
See attachment (agents worksheet in authority mapping file) for current field mapping |
Use view: cspace_org_2_0 but limit to func="coll"; link to committee table to get collection number assignor (links back into committee where func=coll) and data source and to limit the query to get collection committee records added since last load.
Talend job GET_COLLECTOR_ADDS, just finds adds.
Talend job GET_COLLECTOR_UPDATES finds 10 records to update (by hand presumably, output format is for Import Service do not run).
Talend job GET_ORG_V2 (Hoffman) was used to load initially:
Talend job (Cheng):
Need to update existing collector committees also with collection number assignor
Creating URI in ADD and UPDATE
Existing organization records need collectionspace_core:uri. REST update does not work on the collectionspace_core schema, so this has to be done in SQL.
Update existing collectors with their collection number assignor from SMaSCH. REST api.
From SMaSCH committee table where committee_func= "nomen".
See notes on Organization-Collectors above.
Talend job GET_ORG_NOMEN, gets all nomenclature committees and adds to new org-nomenclature vocabulary
Creates URI
Need to remove existing nomenclature records from default organization vocabulary and update refnames as needed.
From SMaSCH committee table where committee_func= "annot".
See notes on Organization-Collectors above.
select legacy_id, display_name, long_name, short_name, foundation_place, contact_name, org_function, substring(contact_name, (patindex('%SMASCH%', dbo.cspace_org_2_0.contact_name)+6), ((patindex('%)%', dbo.cspace_org_2_0.contact_name))-(patindex('%SMASCH%', dbo.cspace_org_2_0.contact_name)+6))) contact_id, coll_num_person_id, data_src_name from dbo.cspace_org_2_0, dbo.committee, data_source where legacy_id=convert(varchar, committee_id) and dbo.committee.data_src_id=data_source.data_src_id and committee_func = 'annot |
Talend job GET_ORG_DETERMINATION gets all determination committees for adding to new org-determination vocabulary
Creating URI
Need to remove existing determination records from default organization vocabulary and update refnames as needed.
From SMaSCH committee table where committee_func= "typ".
See notes on Organization-Collectors above.
select legacy_id, display_name, long_name, short_name, foundation_place, contact_name, org_function, substring(contact_name, (patindex('%SMASCH%', dbo.cspace_org_2_0.contact_name)+6), ((patindex('%)%', dbo.cspace_org_2_0.contact_name))-(patindex('%SMASCH%', dbo.cspace_org_2_0.contact_name)+6))) contact_id, coll_num_person_id, data_src_name from dbo.cspace_org_2_0, dbo.committee, data_source where legacy_id=convert(varchar, committee_id) and dbo.committee.data_src_id=data_source.data_src_id and committee_func = 'typ |
Talend job GET_ORG_TYPEASSERTION gets all committees of function "typ" for loading into new org-typeassertion vocabulary
Creates URI
Need to remove existing type assertion records from default organization vocabulary and update refnames as needed.
Remove nomen, determ, type, and inst values loaded previously.
See the agents worksheet in the authorities mapping attachment
Use the institute table
Access the institute table directly, including the contact schema fields.
Might need to transform the short identifiers and refnames for institutes whose abbreviation starts with an asterisk. No, they seem to act OK.
SELECT dbo.institute.inst_abbr, dbo.institute.inst_name, dbo.institute.inst_address1, dbo.institute.inst_address2, dbo.institute.inst_city, dbo.institute.inst_state, dbo.institute.inst_zipcode, dbo.institute.inst_country, dbo.institute.location, dbo.institute.exch_bal_total, dbo.institute.exch_bal_vasc, dbo.institute.exch_bal_myc, dbo.institute.exch_bal_phyc, dbo.institute.exch_bal_bry, dbo.institute.exch_bal_types, dbo.institute.exch_bal_correction, dbo.institute.areas_they_want, dbo.institute.taxa_they_want, dbo.institute.areas_they_have, dbo.institute.taxa_they_have, dbo.institute.swap_for_lit, dbo.institute.avg_desid, dbo.institute.avg_specs, dbo.institute.phone, dbo.institute.FAX, dbo.institute.URL, dbo.institute.email FROM dbo.institute |
Talend job GET_ORG_INSTITUTION creates XML outputs for org-institution vocabulary and contacts schema.
Creating URI in Institution and Contact
Need to remove existing institution records from default organization vocabulary and update refnames as needed.
There are approximately 12 institutions in various loans and borrows tables whose abbreviation does not match any in the institute table.
Note: Load before taxon-taxon vocabulary
Use the common_name table.
Create collectionspace_core:URI
See the taxonomy worksheet in the attached authorities mapping attachment
SELECT common_name_id, name common_name, dbo.common_name.data_src_id, dbo.data_source.data_src_name FROM dbo.common_name, dbo.data_source where dbo.common_name.data_src_id = dbo.data_source.data_src_id order by common_name_id |
Talend job GET_TAXONOMY_COMMON
Controlled list values for the Name Source field on taxonomy vocabularies (taxon-taxon, taxon-common, and taxon-unverified) are documented at (they are identical to the display value):
NOTE: Need to load the taxon-common vocabulary first!
The taxon_name table is the primary source for this information. Some information coming from taxon_fullname and taxon_noauth_name. Note the repeating information for Taxonomy Authors and Common Names described below.
Note: Check to see if inAuth value is identical on ucjeps2 and ucjeps.cspace. They are: 87036424-e55f-4e39-bd12
See the taxonomy worksheet in the authorities mapping attachment
Using the view cspace_taxonomy_2_3.
Talend job GET_TAXONOMY_V1 (0.2)
Input: basionym_lookup.txt (replace with all_taxon?)
Basionym values handled (skipped if 0 or equal to taxon_id).
Controlled list values for the Name Source field on taxonomy vocabularies (taxon-taxon, taxon-common, and taxon-unverified) are documented in
Existing taxonomy records need collectionspace_core:uri. REST update does not work on the collectionspace_core schema, so this has to be done in SQL.
9292 records have a valid basionym
See the taxonomy worksheet in the authorities mapping attachment
Use the smasch_common_names table.
Get Common Name refnames from the taxon-common output
SELECT taxon_id, common_name_id FROM dbo.smasch_common_names order by taxon_id |
Talend job MERGE_TAXON_COMMON
See the taxonomy worksheet in the authorities mapping attachment.
From taxon_name table (or a new view)
Get author refnames from the org-nomenclature output
SELECT taxon_id, author_id, 'author' author_type FROM dbo.taxon_name where author_id > 0 union SELECT taxon_id, ascr_auth_id, 'ascribed author' author_type FROM dbo.taxon_name where ascr_auth_id > 0 union SELECT taxon_id, p_author_id, 'parenthetical author' author_type FROM dbo.taxon_name where p_author_id > 0 union SELECT taxon_id, p_ascr_auth_id, 'parenthetical ascribed author' author_type FROM dbo.taxon_name where p_ascr_auth_id > 0 order by taxon_id, author_type |
Talend job MERGE_TAXON_AUTHORS
See taxonomy worksheet in authorities mapping file.
Use taxon_name fields (taxon_id and parent_id) to create hierarchical relationship records. Get their refnames from the taxon-taxon output.
See the taxonomy worksheet in the authorities mapping attachment
Use variant_taxon_name table.
SELECT vname_id, variant_name, dbo.variant_taxon_name.data_src_id, dbo.data_source.data_src_name FROM dbo.variant_taxon_name, dbo.data_source where dbo.variant_taxon_name.data_src_id = dbo.data_source.data_src_id order by vname_id |
Talend job GET_TAXONOMY_UNVERIFIED
Changing legacy_id to "unverified" + vname_id so that taxon-variant and taxon-taxon lookup files can be combined for MERGE_COLLOBJ_DETERMINATIONS