Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 104 Current »

With this iteration, we are targeting the data migration mappings for the migration of SMaSCH data into a CollectionSpace instance, based on the version 2.3 release of CollectionSpace. Although there will be additional changes supported in the future, this is the major migration and launch release.

  File Modified
You are not logged in. Any changes you make will be marked as anonymous. You may want to Log In if you already have an account.
No files shared here yet.
  • Drag and drop to upload or browse for files
  • Cataloging

    Core Specimen data

    See attachment for accession mapping. The grid below is not complete.

    Field mapping

    SMaSCH

    CollectionSpace object schema

    CS User Interface Title (** = different from UI)

    Note

    accession.accession_id

    object_number

    Accession Number **

    Change field title

    institution.inst_name

    collection

    Collection

    Dropdown (Collection values)

    accession.coll_num_prefix + coll_number + coll_num_suffix

    other_number

    Number

    Moved to Field Collection Number.

    "collector number"

    other_number_type

    Number Type

    Not needed if we move this to Field Collection Number.

    accession.notes

    comments

    Comments


    objkind.kind

    form (Object Description Information)

    Form

    Dropdown (Form values)

    taxon_fullname.fullname

    title

    Taxon **

    Change field title

    accession.phenology (decoded)

    phase (Object Description Information)

    Phenology **

    Change field title. Dropdown (Phenology values)

    accession.early_jdate (calc)

    field_collection_date_earliest

    Field Collection Date Earliest

    Custom 

    accession.late_jdate (calc)

    field_collection_date_latest

    Field Collection Date Latest

    Custom

    accession.datestring

    field_collection_date

    Field Collection Date

    We are converting this to the structured date format

    committee.committee_abbr

    field_collection_collector

    Field Collection Collector


    accession.loc_lat_decimal

    field_loc_lat_decimal

    Field Location Latitude Decimal

     

    accession.loc_long_decimal

    field_loc_long_decimal

    Field Location Longitude Decimal


    accession.loc_place (calc)

    field_collection_place (?)

    Field Location Verbatim**

    Will map to verbatim field in schema extension for now.  When the Place Authority is available, we will work on creating authority records as well.

    accession.loc_county

    field_loc_county

    Field Location County

    Custom

    accession.loc_state

    field_loc_state

    Field Location State

    Custom

    accession.loc_country

    field_loc_country (temp custom)

    Field Location Country

    Custom

    accession.loc_elevation

    field_loc_elevation (temp custom)

    Field Location Country

    Custom

    accession.coll_num_prefix + coll_number + coll_num_suffix

    ?

    Field collection number


    committee.committee_abbr

    ?

    Field collection collector


    "catalog date"

    date_association

    Date Association

    Fixed text. Do not load into 1.0.  May end up in Associated Date information eventually, or create date/modify date.

    accession.catalog_date

    catalog_date

    Date Text

    Fixed text. Do not load into 1.0.  May end up in Associated Date information eventually, or create date/modify date.

    SQL query

    Using three views to get subsets of the records, identical in their selects, except for the beginning of their accession number:

    • cspace_collobj_uc_2_3b
    • cspace_collobj_jeps_2_3b
    • cspace_collobj_others_2_3b

    These views make the minimal joins needed since refnames (collectors, collection number assignors, filed_in_cabinet genus) will be looked up using text files coming from earlier Talend jobs. 

    ETL

    • Talend: GET_COLLOBJ_UC, GET_COLLOBJ_JEPS, GET_COLLOBJ_OTHERS
    • Inputs
      • cspace_collobj_uc/jeps/others views
      • ucjeps.cspace.berkeley.edu: finds accessions already in CSpace (so we don't add those a second time)
      • all org-collectors to get refnames for collectors (ucjepsorgs-all.txt -- this file created by combining two files by hand)
      • all org-collectors to get refnames for collection number assignors (ucjepsorgs-all.txt -- this file created by combining two files by hand)
      • all taxon-taxon records to get refnames for filed_in_cabinet values (smasch-taxonids.txt -- this file created by combining two files by hand)
      • locations_from_smasch.txt -- a text file provided by Dick Moe that combines several locality fields in SMaSCH into one location field for each accession.  Dick uses this to produce location data for CCH.
    • Outputs
      • objectsincspacealready-others.txt -- objects already in CSpace that will need to be updated separately
      • collobj-others-add.xml -- main XML output to be merged with other files
      • collobject-others-ids.txt -- text file with legacy_id, CSID, uri, and some refnames for bookkeeping

    Notes

    Controlled list values

    See these links

    Cataloging data sets for dropdowns and controlled lists
    1. Collection values:
      1. Jepson Herbarium
      2. University Herbarium, University of California, Berkeley (UC)
      3. University of California
      4. University of California- Santa Barbara
      5. Herbarium (UCSC)
      6. DHN
    2. Form values:
      1. Illustration
      2. Mounted on Paper
      3. Photocopy
      4. Photograph
      5. Stored in a Box or Bag
    3. Phase (Phenology) values:
      1. Cone
      2. Flowering
      3. Flowering/Fruiting
      4. Fruiting
      5. Spores/Sporangia
      6. Vegetative (non-reproductive)
    4. Loaned Object Status values:
      1. Active (Unknown)
      2. Active (All Out)
      3. Active (All In)
      4. Active (Partial)
      5. Active (Discrepancy)
      6. Cancelled (Unknown)
      7. Cancelled (All Out)
      8. Cancelled (All In)
      9. Cancelled (Partial)
      10. Cancelled (Discrepancy)
    5. Other Number Type values:
      1. Other herbaria accession number
      2. Copied from accession number
      3. Exsiccatae number
      4. Project number
      5. Genbank number
      6. Destructive sampling number
      7. Internal cross reference
      8. Miscellaneous

    MERGE: Scientific Determination History (repeating within Cataloging schema)

    Combines the name as filed and the annotation_history table from SMASCH.

    Field mapping

    See the determinations worksheet in the attached accessions mapping document

    SQL

    With some dates fixed, all determinations are now in cspace_determ_2_3.

    Variant determinations: vname_id field recalculated to be "unverified"||vname_id in order to avoid collisions with taxon_id

    ETL

    Talend job MERGE_COLLOBJ_DETERMINATIONS (0.2)

    Run job three times to modify SQL on view to get subsets: UC, JEPS, OTHER. These will map against Collection Object subsets.

    JEPS:
    where object_number like 'JEPS%'
    order by object_number, legacy_id

    UC:
    where substring(object_number, 1, patindex('%[A-Z][0-9]%', object_number)) = 'UC'
    order by object_number, legacy_id

    OTHER:
    where substring(object_number, 1, patindex('%[A-Z][0-9]%', object_number)) in ('UCSC', 'UCLA', 'UCSB', 'DHN')
    order by object_number, legacy_id

    Inputs

    • org_determiners (org-determiners vocabulary refnames)
    • all_taxon (text file with default taxonomy refnames) and taxon_unverified (text file with unverified/verified taxonomy refnames) are merged (tUnite)

    Notes

    View modified to remove time from aa.determined_date.

    MERGE: Type Kind information (repeating within Cataloging schema)

    Field mapping

    See the type kind worksheet in the attached accessions mapping document

    SQL

    Using this query with view

    SELECT
        legacy_id spectype_id,
        rtrim(object_number) accession_id,
        type_kind,
        REFERENCE,
        figure,
        determined_by_id,
        determined_by,
        notes,
        basionym_id,
        verified_basionym
    FROM
        dbo.cspace_spectype_2_3
    where object_number <> ''
    order by object_number, legacy_id desc

    create view cspace_spectype_2_3 as
    select
        ti.typeinfo_id as legacy_id,
        rtrim(ti.accession_id) as object_number,
        tk.kind as type_kind,
        case
            when ti.place_pub is not null and ti.place_pub != ''
                    and ti.year_pub is not null and ti.year_pub not in ('', '0', 'unkn')
                then ti.place_pub + ', ' + ti.year_pub
            when (ti.place_pub is null or ti.place_pub = '')
                    and ti.year_pub is not null and ti.year_pub not in ('', '0', 'unkn')
                then year_pub
            else null
        end as reference,
        ti.figure as figure,
        ti.typ_committee_id as determined_by_id,
        c.committee_abbr as determined_by,
        case
            when ti.notes is null and v.descr is not null
                then 'Type Voucher: ' + v.descr
            when ti.notes is null and v.descr is null
                then null
            when ti.notes is not null and v.descr is not null
                then ti.notes + '; Type Voucher: ' + v.descr
            when ti.notes is not null and v.descr is null
                then ti.notes
        end as notes,
        ti.basionym_id as basionym_id,
        tf.fullname as verified_basionym
    from
        types_info ti,
        typekind tk,
        committee c,
        voucher v,
        taxon_fullname tf
    where ti.typekind_id *= tk.typekind_id
    and ti.typ_committee_id *= c.committee_id
    and ti.accession_id *= v.accession_id
    and v.vouchkind_id = 56
    and ti.basionym_id *= tf.taxon_id

    ETL

    Talend job MERGE_COLLOBJ_SPECTYPE

    Notes

    MERGE: Text annotations (repeating within Cataloging schema)

    Note that in SMaSCH, a table named voucher was used to capture text notes of various kinds.  Many of these are being cleaned up and put in other places, but some will remain.

    Field mapping

    See vouchers worksheet in attached accession mapping document

    SQL

    SELECT
        object_number,
        annotation_type,
        annotation_note,
        annotation_date
    FROM
        dbo.cspace_anno_2_3
    ORDER by object_number, annotation_date
    create view cspace_anno_2_3 as
    select
        rtrim(v.accession_id) as object_number,
        vk.kind as annotation_type,
        v.descr as annotation_note,
        v.voucher_date as annotation_date
    from
        voucher v,
        vouchkind vk
    where v.vouchkind_id *= vk.vouchkind_id
    and v.vouchkind_id in (24, 29, 41, 56, 73, 78, 79)
    union
    select
        rtrim(ti.accession_id) as object_number,
        'holotype location' as annotation_type,
        ti.inst_abbr as annotation_note,
        ti.mod_date as annotation_date
    from types_info ti
    where ti.inst_abbr is not null
    and ti.inst_abbr != ''
    union
    select
        aa.accession_id as object_number,
        'additional taxa' as annotation_type,
        tf.fullname as annotation_note,
        aa.folder_name_date as annotation_date
    from
        annotation_asfiled aa,
        taxon_fullname tf
    where aa.addl_taxon_id *= tf.taxon_id
    and addl_taxon_id is not null and addl_taxon_id != 0

    ETL

    Using the SQL above directly for now.  Rtrim the accession_id.

    Talend job MERGE_COLLOBJ_ANNOTATIONS

    Notes

    MERGE: Hybrid Parents (repeating within Cataloging schema)

    From annotation_asfiled table

    Field mapping

    See annotations tab in attached accessions mapping document

    SQL

    SELECT
    annotation_asfiled.accession_id, annotation_asfiled.parent_name_id,
      (case when parent_name_modifier_kind.taxon_modifier=' ' then null else
     parent_name_modifier_kind.taxon_modifier end) as parent_modifier
    FROM
        annotation_asfiled, parent_name_modifier_kind
    where annotation_asfiled.parent_name_modifier_id = parent_name_modifier_kind.taxon_modifier_id and
    parent_name_id <> 0
    order by accession_id, parent_seqno

    ETL

    Talend job MERGE_COLLOBJ_HYBRIDPARENTS

    Notes

    MERGE: Other Numbers (repeating within Cataloging schema)

    From accession and voucher tables

    Field mapping

    See voucher worksheet in attached accession mapping document.

    SQL

    SELECT
        legacy_id,
        other_number,
        other_number_type
    FROM
        dbo.cspace_othernum_2_3
    ORDER by legacy_id
    create view cspace_othernum_2_3 as
    select
        rtrim(v.accession_id) as legacy_id,
        v.descr as other_number,
        case
            when v.vouchkind_id = 45 then
                'GenBank Code'
            when v.vouchkind_id = 55 then
                'Other Label Number'
            when v.vouchkind_id = 71 then
                'U.C. Botanical Garden Accession Number'
        end as other_number_type
    from
        voucher v
    where v.vouchkind_id in (45, 55, 71)

    ETL

    Talend job MERGE_COLLOBJ_OTHERNUM

    Notes

    45 = 'genbank code', 55 = 'other label numbers', 71 = 'U.C. Botanical Garden'

    Merge collection objects and load

    UPDATE: Collection Object - URI (done)

    Use SQL to update collectionspace_core:uri for existing collection objects.

    Update collection object records already in ucjeps.cspace (Phase II)

    There are approximately 99 collection objects in ucjeps.cspace already.  All are UC.  Check to see if any have been modified in CSpace and update as needed.

    Procedures

    Insert loan object status values into dynamic controlled list

    UCJEPS needs approximately 12 new values in a dynamic controlled list that is used by both loans in and loans out.  These can be loaded via POST commands and should be ready to go.

    Loans Out

    Field mapping

    See attachment for field mapping. This grid is outdated.

    SMaSCH

    CollectionSpace loans out schema

    CS User Interface Title (** = different from UI)

    Note

    loan_event.uc_loan_num + loan_event.jeps_loan_num

    Loan out number

     

    smasch has two types of loan identifiers.
    Loan Number relates accessions in a single collection that make up a loan.
    Loan ID relates loans from each collection that make up a loan.
    "Loan Number" = collection + loan_num
    "Loan ID" = loan_id

    loan_event.curr_inst

    Borrower

     

    Probably should keep this as borrowing institution (organization authority)

    loan_event:         
      curr_for_fname,
      curr_for_lname,
      curr_for_position

    Borrower's contact

     

    Probably should keep this as borrowing agent (person authority)

    loan_event.inhouse_notes + loan_event.noteworthy_inclusions

    Loan out note

     

     

    loan_event.loan_status +
    loan_detail.return_status

    Loaned object status

     

    It looks like this field is not currently active.  smasch keeps track of two status types:
    loan status ('Active' or 'Cancelled')
    return status ('All in', 'All out', 'Partial', 'Discrepancy', 'Unknown')

    Loan date?

     

    Loan Out Date

    Does SMaSCH have a loan date?

    SQL query

    view:

    Notes

    MERGE: Loan Out Items (repeating group within Loan Out procedure records)

    From loan_accn_ids table

    Field mapping

    SQL

    ETL

    Notes

    MERGE: Loan Taxonomy (repeating group within Loan Out procedure records)

    From several fields in loan_event table

    Field mapping

    SQL

    ETL

    Notes

    MERGE: Loan Out Returns (repeating group within Loan Out procedure records)

    From return_detail table

    Field mapping

    SQL

    ETL

    Notes

    MERGE: Loan Out Transfers (repeating group within Loan Out procedure records)

    From transfer_event table and possibly from loan_event table (orig_inst, orig_for_lname, orig_for_fname, orig_for_position)

    Field mapping

    SQL

    ETL

    Notes

    Merge loan out information and load

    Object to Loan Out Relationships (Phase II)

    These will be added as detail items.  Relationship records might not be needed.

    Relationships between UC and JEPS loans that have the same loan_id

    Relationship records need to be created between separate loan out records to handle UC and JEPS loans that have the same loan_id.

    Field mapping

    SQL

    ETL

    Notes

    Borrows (Loans In)

    Field mapping

    See attachment for field mapping.

    SQL

    Notes

    MERGE: Loan In Taxonomy (repeating group within Loan In procedure records)

    From several fields in borrow table

    Field mapping

    SQL

    ETL

    Notes

    MERGE: Loan In Returns (repeating group within Loan In procedure records)

    From borrow_returns table

    Field mapping

    SQL

    ETL

    Notes

    MERGE: Loan In Transfers (repeating group within Loan In procedure records)

    From several fields in borrow_xfer_out table

    Field mapping

    SQL

    ETL

    Notes

    Merge loans in information and load

    Handle unmatched borrow_return records

    There are 20 records in borrow_return whose borrow_id does not match with one in the main borrow table.  They look like they could match on lender_loan_num and institution. It will probably be easiest to update these by hand.

    Handle unmatched borrow_xfer_out records

    There are3 records in borrow_xfer_out whose borrow_id does not match with one in the main borrow table.  They look like they could match on lender_loan_num and institution. It will probably be easiest to update these by hand.

    Location (Location and Movement procedure)

    Use annotation_asfiled.collectionkind_id (Name of Collection, e.g., Main, Reference, Type) to indicate storage of specimen

    Field mapping

    SQL

    ETL

    Notes

    Object-Location Relationship records

    Field mapping

    SQL

    ETL

    Notes

    Vocabularies (load these first)

    Person vocabulary (default person-person)

    Field mapping

    See agents worksheet in authorities mapping file for current field mapping.

    SQL query

    View: cspace_person_2_3.

    Note: View modified to exclude the persons from loans and borrows.  Those are now handled separately.

    ETL

    Talend job GET_PERSON_ADDS adds new person records.

    Updates to be found later.

    Obsolete 5/18: Talend job diff_person3 compares SMaSCH view (cspace_persons_2_0) to ucjeps.cspace (persons_common) on legacy_id, displayname, bio_notes, birthdate, and deathdate.  May 12: Found 3 new records and 2 updates. This can also be confirmed by viewing person table in SMaSCH in table order, noting the records added or updated in reverse chronological order.

    Talend job GET_PERSON_V1 (Hoffman) was used to load data initially:

    • Input: Query view
    • See  SMaSCH data mapping notes for some special data handling
    • Output: XML file for loading (batches of 3000)
    • Output: List of CSIDs (to facilitate deleting)
    • Output: CSV file of basic identifiers (for record keeping and as input to one of the Org loading jobs).
    • 8945 rows

    Notes

    Status: SMaSCH persons records already loaded, plus persons derived from borrow and loan tables. However, these might have been updated: Person records might be updated, and borrow and loan records might be added or changed.  Person records last queried from SMaSCH on 3/8/2012 at 9:32.

    Creates URI

    5/18/2012

    New view persons_2_3 includes people from borrows, loans, transfers, etc, so new approach required.

    Excluding records where forename and surname are both null.

    Too many non-unique names picked up by the select distinct statements for the trans_person table.  Need to fix.

    Those records are static from transaction tables, so they will have to be refreshed when it's time to switchover!

    ADD records can be identifying by comparing legacy_id in SMaSCH view to person_common.shortidentifier in CSpace.

    UPDATES can be found two ways: 1) using my person_diff approach AFTER new records have been added or 2) doing a table scan on smasch.person and looking in database order.

    ucjepspersonids-all.txt is used by many other jobs.  Create it by concatenating the ID's output from original load (in CRH archive2 directory) with new adds from this batch.

    UPDATE: Person - URI (done)

    Existing person records need collectionspace_core:uri.  REST update does not work on the collectionspace_core schema, so this has to be done in SQL.

    Delete loan persons loaded already into ucjeps.cspace

    See if any are used already.  Can use short identifiers in a certain range.

    Add person records parsed from various loans and borrows tables

    In SMaSCH, there are several unstructured person fields.  These are being parsed and then loaded as unique person authority records.

    Organization-Collectors vocabulary (default organization vocabulary)

    From SMaSCH committee table where committee_func= "coll"

    Field mapping

    See attachment (agents worksheet in authority mapping file) for current field mapping

    SQL query

    Use view: cspace_org_2_0 but limit to func="coll"; link to committee table to get collection number assignor (links back into committee where func=coll) and data source and to limit the query to get collection committee records added since last load.

    ETL

    Talend job GET_COLLECTOR_ADDS, just finds adds. 

    Talend job GET_COLLECTOR_UPDATES finds 10 records to update (by hand presumably, output format is for Import Service do not run).

    Talend job GET_ORG_V2 (Hoffman) was used to load initially:

    • See SMaSCH data mapping notes for some special handling
    • Input: Query view
    • Input: ucjepspersonout-ids.txt from GET_PERSON_V1 (to get refnames of organization contacts from person authority)
    • Output: XML file for loading (batches of 4000)
    • Output: List of CSIDs (to facilitate deleting)
    • Output: CSV file of basic identifiers (for record keeping).
    • 49845 rows

    Talend job (Cheng):

    • Input: Text file from Dick Moe, non-colliding-collectors.in, containing names of seaweed collectors
    • 1546 rows
    • Already loaded into ucjeps.cspace

    Notes

    Need to update existing collector committees also with collection number assignor

    Creating URI in ADD and UPDATE

    UPDATE: Organization - URI (done)

    Existing organization records need collectionspace_core:uri.  REST update does not work on the collectionspace_core schema, so this has to be done in SQL.

    UPDATE: Organization - collection number assignors

    Update existing collectors with their collection number assignor from SMaSCH.  REST api.

    Organization-Nomenclature Committee vocabulary (organization-nomenclature vocabulary)

    From SMaSCH committee table where committee_func= "nomen".

    Field mapping

    See notes on Organization-Collectors above.

    SQL query

    ETL

    Talend job GET_ORG_NOMEN, gets all nomenclature committees and adds to new org-nomenclature vocabulary

    Notes

    Creates URI

    Need to remove existing nomenclature records from default organization vocabulary and update refnames as needed.

    Organization-Determination Committee vocabulary (organization-determination vocabulary)

    From SMaSCH committee table where committee_func= "annot".

    Field mapping

    See notes on Organization-Collectors above.

    SQL query

    select legacy_id, display_name, long_name, short_name,
    foundation_place, contact_name, org_function,
    substring(contact_name, (patindex('%SMASCH%', dbo.cspace_org_2_0.contact_name)+6),
    ((patindex('%)%', dbo.cspace_org_2_0.contact_name))-(patindex('%SMASCH%',
      dbo.cspace_org_2_0.contact_name)+6))) contact_id, coll_num_person_id, data_src_name
    from dbo.cspace_org_2_0, dbo.committee, data_source
    where legacy_id=convert(varchar,
      committee_id) and dbo.committee.data_src_id=data_source.data_src_id and
     committee_func = 'annot

    ETL

    Talend job GET_ORG_DETERMINATION gets all determination committees for adding to new org-determination vocabulary

    Notes

    Creating URI

    Need to remove existing determination records from default organization vocabulary and update refnames as needed.

    Organization-Type Assertion Committee vocabulary (organization-typeassertion vocabulary)

    From SMaSCH committee table where committee_func= "typ".

    Field mapping

    See notes on Organization-Collectors above.

    SQL query

    select legacy_id, display_name, long_name, short_name,
    foundation_place, contact_name, org_function,
    substring(contact_name, (patindex('%SMASCH%', dbo.cspace_org_2_0.contact_name)+6),
    ((patindex('%)%', dbo.cspace_org_2_0.contact_name))-(patindex('%SMASCH%',
      dbo.cspace_org_2_0.contact_name)+6))) contact_id, coll_num_person_id, data_src_name
    from dbo.cspace_org_2_0, dbo.committee, data_source
    where legacy_id=convert(varchar,
      committee_id) and dbo.committee.data_src_id=data_source.data_src_id and
     committee_func = 'typ

    ETL

    Talend job GET_ORG_TYPEASSERTION gets all committees of function "typ" for loading into new org-typeassertion vocabulary

    Notes

    Creates URI

    Need to remove existing type assertion records from default organization vocabulary and update refnames as needed.

    Clean up organization-organization (collectors) vocabulary in CSpace

    Remove nomen, determ, type, and inst values loaded previously.

    Organization-Institution vocabulary (organization-institute vocabulary)

    Field mapping

    See the agents worksheet in the authorities mapping attachment

    Use the institute table

    SQL query

    Access the institute table directly, including the contact schema fields.

    Might need to transform the short identifiers and refnames for institutes whose abbreviation starts with an asterisk. No, they seem to act OK.

    SELECT dbo.institute.inst_abbr,
    		dbo.institute.inst_name,
    		dbo.institute.inst_address1,
    		dbo.institute.inst_address2,
    		dbo.institute.inst_city,
    		dbo.institute.inst_state,
    		dbo.institute.inst_zipcode,
    		dbo.institute.inst_country,
    		dbo.institute.location,
    		dbo.institute.exch_bal_total,
    		dbo.institute.exch_bal_vasc,
    		dbo.institute.exch_bal_myc,
    		dbo.institute.exch_bal_phyc,
    		dbo.institute.exch_bal_bry,
    		dbo.institute.exch_bal_types,
    		dbo.institute.exch_bal_correction,
    		dbo.institute.areas_they_want,
    		dbo.institute.taxa_they_want,
    		dbo.institute.areas_they_have,
    		dbo.institute.taxa_they_have,
    		dbo.institute.swap_for_lit,
    		dbo.institute.avg_desid,
    		dbo.institute.avg_specs,
    		dbo.institute.phone,
    		dbo.institute.FAX,
    		dbo.institute.URL,
    		dbo.institute.email
    FROM	dbo.institute

    ETL

    Talend job GET_ORG_INSTITUTION creates XML outputs for org-institution vocabulary and contacts schema.

    Notes

    Creating URI in Institution and Contact

    Need to remove existing institution records from default organization vocabulary and update refnames as needed.

    Add institutions parsed from loans and borrows tables

    There are approximately 12 institutions in various loans and borrows tables whose abbreviation does not match any in the institute table.

    Common Names vocabulary (taxon-common vocabulary)

    Note: Load before taxon-taxon vocabulary

    Use the common_name table.

    Create collectionspace_core:URI

    Field mapping

    See the taxonomy worksheet in the attached authorities mapping attachment

    SQL

    SELECT
        common_name_id,
        name common_name,
        dbo.common_name.data_src_id,
        dbo.data_source.data_src_name
    FROM
        dbo.common_name, dbo.data_source
    where dbo.common_name.data_src_id = dbo.data_source.data_src_id
    order by common_name_id

    ETL

    Talend job GET_TAXONOMY_COMMON

    Notes

    Controlled list values for the Name Source field on taxonomy vocabularies (taxon-taxon, taxon-common, and taxon-unverified) are documented at (they are identical to the display value):

    Taxonomy vocabulary (default taxon-taxon vocabulary)

    NOTE: Need to load the taxon-common vocabulary first!

    The taxon_name table is the primary source for this information. Some information coming from taxon_fullname and taxon_noauth_name. Note the repeating information for Taxonomy Authors and Common Names described below.

    Note: Check to see if inAuth value is identical on ucjeps2 and ucjeps.cspace.  They are: 87036424-e55f-4e39-bd12

    Field mapping

    See the taxonomy worksheet in the authorities mapping attachment

    SQL

    Using the view cspace_taxonomy_2_3.

    ETL

    Talend job GET_TAXONOMY_V1 (0.2)

    Input: basionym_lookup.txt (replace with all_taxon?)

    Notes

    Basionym values handled (skipped if 0 or equal to taxon_id).

    Controlled list values for the Name Source field on taxonomy vocabularies (taxon-taxon, taxon-common, and taxon-unverified) are documented in

    UPDATE: Taxon - URI

    Existing taxonomy records need collectionspace_core:uri.  REST update does not work on the collectionspace_core schema, so this has to be done in SQL.

    9292 records have a valid basionym

    MERGE: Common Names for Taxonomic Terms (repeating within Taxonomy schema for taxon-taxon vocabulary)

    Field mapping

    See the taxonomy worksheet in the authorities mapping attachment

    Use the smasch_common_names table.

    Get Common Name refnames from the taxon-common output

    SQL

    SELECT
        taxon_id,
        common_name_id
    FROM
        dbo.smasch_common_names
    order by taxon_id

    ETL

    Talend job MERGE_TAXON_COMMON

    Notes

    MERGE: Authors for Taxonomic Terms (repeating within Taxonomy schema for taxon-taxon vocabulary)

    Field mapping

    See the taxonomy worksheet in the authorities mapping attachment.

    From taxon_name table (or a new view)

    Get author refnames from the org-nomenclature output

    SQL

    SELECT
        taxon_id,
        author_id,
        'author' author_type
    FROM
        dbo.taxon_name
    where author_id > 0
    union
    SELECT
        taxon_id,
        ascr_auth_id,
        'ascribed author' author_type
    FROM
        dbo.taxon_name
    where ascr_auth_id > 0
    union
    SELECT
        taxon_id,
        p_author_id,
        'parenthetical author' author_type
    FROM
        dbo.taxon_name
    where p_author_id > 0
    union
    SELECT
        taxon_id,
        p_ascr_auth_id,
        'parenthetical ascribed author' author_type
    FROM
        dbo.taxon_name
    where p_ascr_auth_id > 0
    order by taxon_id, author_type

    ETL

    Talend job MERGE_TAXON_AUTHORS

    Notes

    Taxonomy Hierarchical Relations

    Field mapping

    See taxonomy worksheet in authorities mapping file.

    Use taxon_name fields (taxon_id and parent_id) to create hierarchical relationship records.  Get their refnames from the taxon-taxon output.

    SQL

    Notes

    Merge default taxonomy information and load

    Unverified Taxonomic Names vocabulary (taxon-unverified vocabulary)

    Field mapping

    See the taxonomy worksheet in the authorities mapping attachment

    Use variant_taxon_name table.

    SQL

    SELECT
        vname_id,
        variant_name,
        dbo.variant_taxon_name.data_src_id,
        dbo.data_source.data_src_name
    FROM
        dbo.variant_taxon_name, dbo.data_source
    where dbo.variant_taxon_name.data_src_id = dbo.data_source.data_src_id
    order by vname_id

    ETL

    Talend job GET_TAXONOMY_UNVERIFIED

    Notes

    Changing legacy_id to "unverified" + vname_id so that taxon-variant and taxon-taxon lookup files can be combined for MERGE_COLLOBJ_DETERMINATIONS

    • No labels