Skip to end of metadata
Go to start of metadata

You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 202 Current »

With this iteration, we are targeting the data migration mappings for the migration of SAGE data into a CollectionSpace instance, based on the version 2.4 release of CollectionSpace. This launch is intended to support the full migration of the UC Botanical Garden SAGE system into a production CollectionSpace system.  Changes from the previous mapping are in blue.

  File Modified
You are not logged in. Any changes you make will be marked as anonymous. You may want to Log In if you already have an account.
No files shared here yet.
  • Drag and drop to upload or browse for files
  • Authorities

    Person (default person vocabulary) - Lam (SQL ready), Talend dev (Yuteh)

    Tentatively, we will have one person vocabulary, mapping the agents from SAGE that are also in the persons table.

    There might be some need to produce merge parts, for instance if we need to create, preferred and non-preferred terms for individual records.

    Field mapping

    SQL

    select
        p.person_id as short_identifier,
        an.name as display_name,
        p.birth_date as birth_date,
        p.death_date as death_date,
        datepart(year, p.birth_date) as birth_date_year,
        datepart(month, p.birth_date) as birth_date_month,
        datepart(day, p.birth_date) as birth_date_day,
        case when p.birth_date is null then null
            else str_replace(convert(char(10), p.birth_date, 102), '.', '-')
                + 'T00:00:00Z'
        end as birth_date_Esclr,
        case when p.birth_date is null then null
            else str_replace(convert(char(10), p.birth_date, 102), '.', '-')
                + 'T23:59:59Z'
        end as birth_date_Lsclr,
        datepart(year, p.death_date) as death_date_year,
        datepart(month, p.death_date) as death_date_month,
        datepart(day, p.death_date) as death_date_day,
        case when p.death_date is null then null
            else str_replace(convert(char(10), p.death_date, 102), '.', '-')
                + 'T00:00:00Z'
        end as death_date_Esclr,
        case when p.death_date is null then null
            else str_replace(convert(char(10), p.death_date, 102), '.', '-')
                + 'T23:59:59Z'
        end as death_date_Lsclr,
        p.person_remark as name_note
    from
        person p,
        agent_name an
    where p.person_id = an.agent_id
    and an.name_type = 'pref'
    and p.person_id > 0

    Notes

    •  

    Talend/ETL

    bgimportperson1 0.2: Revising earlier talend job (based on 1.8) to change refname format, remove ampersand handling, and include collectionspace_core:uri element

    New Talend job by Yuteh: BG_person 0.1

    MERGE: Person Terms (repeating group) - Lam (SQL ready), Talend job (yuteh)

    E.g., Different name forms (preferred, alias)

    Field mapping

    SQL

    -- kewabbr as preferred names when they are duplicates of preferred name
    select
        an.agent_id as short_identifier,
        an.name as display_name,
        an.name as term_name,
        an.name_type as term_type,
        'accepted' as term_status,
        'English' as language,
        an_source.name as source,
        p.data_source_id as source_id,
        p.prefix as title,
        p.first_name as forename,
        p.middle_name as middle_name,
        p.last_name as surname,
        p.suffix as name_addition,
        case
            when p.first_name is null and p.middle_name is null
                then substring(p.last_name, 1, 1)
            when p.first_name is not null and p.middle_name is null
                then substring(p.first_name, 1, 1) + substring(p.last_name, 1, 1)
            when p.first_name is null and p.middle_name is not null
                then substring(p.middle_name, 1, 1) + substring(p.last_name, 1, 1)
            else substring(p.first_name, 1, 1) +
                substring(p.middle_name, 1, 1) +
                substring(p.last_name, 1, 1)
        end as initials,
        1 as preferred_fg
    from
        agent_name an,
        agent_name an2,
        person p,
        agent_name an_source,
        agent_name_role anr
    where p.person_id = an.agent_id
    and an.agent_id = an2.agent_id
    and an2.name_type = 'pref'
    and an.name_type = 'kewabbr'
    and an.name = an2.name
    and p.data_source_id = an_source.agent_id
    and an_source.agent_name_id = anr.agent_name_id
    and an_source.name_type = anr.name_type
    and an_source.name_type = 'pref'
    and anr.role_type = 'datasrc'
    and p.person_id > 0
    union
    -- preferred names except those that are duplicates with a kewabbr
    select
        an.agent_id as short_identifier,
        an.name as display_name,
        an.name as term_name,
        an.name_type as term_type,
        'accepted' as term_status,
        'English' as language,
        an_source.name as source,
        p.data_source_id as source_id,
        p.prefix as title,
        p.first_name as forename,
        p.middle_name as middle_name,
        p.last_name as surname,
        p.suffix as name_addition,
        case
            when p.first_name is null and p.middle_name is null
                then substring(p.last_name, 1, 1)
            when p.first_name is not null and p.middle_name is null
                then substring(p.first_name, 1, 1) + substring(p.last_name, 1, 1)
            when p.first_name is null and p.middle_name is not null
                then substring(p.middle_name, 1, 1) + substring(p.last_name, 1, 1)
            else substring(p.first_name, 1, 1) +
                substring(p.middle_name, 1, 1) +
                substring(p.last_name, 1, 1)
        end as initials,
        1 as preferred_fg
    from
        agent_name an,
        person p,
        agent_name an_source,
        agent_name_role anr
    where p.person_id = an.agent_id
    and an.name_type = 'pref'
    and p.data_source_id = an_source.agent_id
    and an_source.agent_name_id = anr.agent_name_id
    and an_source.name_type = anr.name_type
    and an_source.name_type = 'pref'
    and anr.role_type = 'datasrc'
    and p.person_id > 0
    and an.agent_name_id not in (
        select an_pref.agent_name_id
        from person p, agent_name an_pref, agent_name an_kew
        where p.person_id = an_pref.agent_id
        and an_pref.agent_id = an_kew.agent_id
        and an_pref.name_type = 'pref'
        and an_kew.name_type = 'kewabbr'
        and an_pref.name = an_kew.name)
    union
    -- alternate names that are not duplicates of pref name
    select
        an.agent_id as short_identifier,
        an.name as display_name,
        an.name as term_name,
        an.name_type as term_type,
        'accepted' as term_status,
        'English' as language,
        an_source.name as source,
        p.data_source_id as source_id,
        p.prefix as title,
        p.first_name as forename,
        p.middle_name as middle_name,
        p.last_name as surname,
        p.suffix as name_addition,
        case
            when p.first_name is null and p.middle_name is null
                then substring(p.last_name, 1, 1)
            when p.first_name is not null and p.middle_name is null
                then substring(p.first_name, 1, 1) + substring(p.last_name, 1, 1)
            when p.first_name is null and p.middle_name is not null
                then substring(p.middle_name, 1, 1) + substring(p.last_name, 1, 1)
            else substring(p.first_name, 1, 1) +
                substring(p.middle_name, 1, 1) +
                substring(p.last_name, 1, 1)
        end as initials,
        0 as preferred_fg
    from
        agent_name an,
        agent_name an2,
        person p,
        agent_name an_source,
        agent_name_role anr
    where p.person_id = an.agent_id
    and an.agent_id = an2.agent_id
    and an2.name_type = 'pref'
    and an.name_type != 'pref'
    and an.name != an2.name
    and p.data_source_id = an_source.agent_id
    and an_source.agent_name_id = anr.agent_name_id
    and an_source.name_type = anr.name_type
    and an_source.name_type = 'pref'
    and anr.role_type = 'datasrc'
    and p.person_id > 0
    order by short_identifier, preferred_fg desc;

    Notes

    Talend/ETL

    New Talend job by Yuteh: BG_personTerm 0.1

    Organization (default Org vocabulary will be UCBG institutions) - Lam (SQL ready), Talend dev (yuteh)

    We will have the following three organization vocabularies:

    • organizations (institutions from SAGE organizations plus the voucher institutions)
    • collectors (coming from the separate SAGE table)
    • groups (mostly KEW names for nomenclature)

    Field mapping

    SQL

    select
       convert(varchar(5), org_id) as short_identifier,
       org_name as display_name,
       org_remark as notes
    from organization
    union
    select distinct
        voucher_number as short_identifier,
        case when voucher_name is null then voucher_number||' (voucher institution)'
          else voucher_name end as display_name,
        null as notes
    from voucher_number_code
    

    Notes

    Talend/ETL

    New Talend job by Yuteh: BG_org_org 0.1

    MERGE: Organization Terms (repeating group) - Lam (SQL ready), Talend dev (Yuteh)

    Field mapping

    SQL

    select
        convert(varchar(5), o.org_id) as short_identifier,
        an.name as display_name,
        an.name_type as term_type,
        'accepted' as term_status,
        an.name as term_name,
        'English' as term_language,
        an_source.name as term_source,
        an_source.agent_id as term_source_id,
        an.name as main_body_name,
        o.dept_unit as additions_to_name,
        1 as preferred_fg
    from
        organization o,
        agent_name an,
        agent_name an_source,
        agent_name_role anr
    where o.org_id = an.agent_id
    and an.name_type = 'pref'
    and o.data_source_id = an_source.agent_id
    and an_source.agent_name_id = anr.agent_name_id
    and an_source.name_type = anr.name_type
    and an_source.name_type = 'pref'
    and anr.role_type = 'datasrc'
    union
    select
        convert(varchar(5), o.org_id) as short_identifier,
        an.name as display_name,
        an.name_type as term_type,
        'accepted' as term_status,
        an.name as term_name,
        'English' as term_language,
        an_source.name as term_source,
        an_source.agent_id as term_source_id,
        an.name as main_body_name,
        o.dept_unit as additions_to_name,
        0 as preferred_fg
    from
        organization o,
        agent_name an,
        agent_name an2,
        agent_name an_source,
        agent_name_role anr
    where o.org_id = an.agent_id
    and an.agent_id = an2.agent_id
    and an.name_type != 'pref'
    and an2.name_type = 'pref'
    and an.name != an2.name
    and o.data_source_id = an_source.agent_id
    and an_source.agent_name_id = anr.agent_name_id
    and an_source.name_type = anr.name_type
    and an_source.name_type = 'pref'
    and anr.role_type = 'datasrc'
    union
    select
        voucher_number as short_identifier,
        case when voucher_name is null then voucher_number||' (voucher institution)'
          else voucher_name end as display_name,
        'pref' as term_type,
        'accepted' as term_status,
        case when voucher_name is null then voucher_number||' (voucher institution)'
          else voucher_name end as term_name,
        'English' as term_language,
        null as term_source,
        null as term_source_id,
        case when voucher_name is null then voucher_number||' (voucher institution)'
          else voucher_name end as main_body_name,
        null as additions_to_name,
        1 as preferred_fg
    from voucher_number_code
    order by short_identifier, preferred_fg desc
    

    Notes

    Talend/ETL

    New Talend job by Yuteh: BG_org_orgTerm 0.1

    MERGE: Organization Associated Persons (repeating group) - Lam (SQL ready), Talend dev (Yuteh)

    Field mapping

    SQL

    -- need to verify person_id = 0 records.
    select
        op.org_id as legacy_org_id,
        op.person_id as legacy_affperson_id,
        an.name as affiliated_person,
        'urn:cspace:botgarden.cspace.berkeley.edu:personauthorities:name(person):item:name('
        + convert(varchar(10), op.person_id) + ')'''+ an.name + ''''
        as affiliated_person_refname,
        'member' as affiliated_person_type
    from
        org_person op,
        agent_name an
    where op.person_id = an.agent_id
    and an.name_type = 'pref'

    Notes

    Talend/ETL

    New Talend job by Yuteh: BG_org_orgAssocPsn 0.1

    Collectors (Organization vocabulary) - Lam (SQL ready), Talend dev (Yuteh)

    Note: Only one name in SAGE so no repeating groups

    Field mapping

    SQL

    select
        cc.collector_id as short_identifier,
        cc.collector_name as display_name,
        cc.collector_name as term_name,
        'English' as term_language,
        'pref' as term_type,
        'accepted' as term_status,
        cc.collector_name as main_body_name,
        1 as preferred_fg
    from collector_code cc
    where collector_id != 1
    order by cc.collector_name

    Notes

    Talend/ETL

    New Talend job by Yuteh: BG_org_collector 0.1

    Groups (Organization Vocabulary) - Lam (SQL ready), Talend dev (Yuteh)

    Field mapping

    SQL

    select
        group_id as short_identifier,
        group_name as display_name,
        group_remark as notes
    from
        group_master
    order by group_id

    Notes

    Talend/ETL

    New Talend job by Yuteh: BG_org_group 0.1

    MERGE: Groups Terms (repeating group) - Lam (SQL ready), Talend dev (Yuteh)

    Field mapping

    SQL

    -- kewabbr group names that are duplicates of pref group names
    select
        gm.group_id as short_identifier,
        an.name as display_name,
        an.name_type as term_type,
        'accepted' as term_status,
        an.name as term_name,
        'English' as term_language,
        an.name as term_source,
        an.agent_id as term_source_id,
        an.name as main_body_name,
        1 as preferred_fg
    from
        group_master gm,
        agent_name an,
        agent_name an2,
        agent_name an_source,
        agent_name_role anr
    where gm.group_id = an.agent_id
    and an.agent_id = an2.agent_id
    and an.name_type = 'kewabbr'
    and an2.name_type = 'pref'
    and an.name = an2.name
    and gm.data_source_id = an_source.agent_id
    and an_source.agent_name_id = anr.agent_name_id
    and an_source.name_type = anr.name_type
    and an_source.name_type= 'pref'
    and anr.role_type = 'datasrc'
    union
    -- pref group names except for pref names that are duplicates of kewabbr
    select
        gm.group_id as short_identifier,
        an.name as display_name,
        an.name_type as term_type,
        'accepted' as term_status,
        an.name as term_name,
        'English' as term_language,
        an.name as term_source,
        an.agent_id as term_source_id,
        an.name as main_body_name,
        1 as preferred_fg
    from
        group_master gm,
        agent_name an,
        agent_name an_source,
        agent_name_role anr
    where gm.group_id = an.agent_id
    and gm.data_source_id = an_source.agent_id
    and an_source.agent_name_id = anr.agent_name_id
    and an_source.name_type = anr.name_type
    and an_source.name_type= 'pref'
    and anr.role_type = 'datasrc'
    and an.name_type = 'pref'
    and an.agent_name_id not in (
        select an_pref.agent_name_id
        from group_master gm, agent_name an_pref, agent_name an_kew
        where gm.group_id = an_pref.agent_id
        and an_pref.agent_id = an_kew.agent_id
        and an_pref.name_type = 'pref'
        and an_kew.name_type = 'kewabbr'
        and an_pref.name = an_kew.name)
    union
    -- non-pref group names that are not duplicates of pref name
    select
        gm.group_id as short_identifier,
        an.name as display_name,
        an.name_type as term_type,
        'accepted' as term_status,
        an.name as term_name,
        'English' as term_language,
        an.name as term_source,
        an.agent_id as term_source_id,
        an.name as main_body_name,
        0 as preferred_fg
    from
        group_master gm,
        agent_name an,
        agent_name an2,
        agent_name an_source,
        agent_name_role anr
    where gm.group_id = an.agent_id
    and an.agent_id = an2.agent_id
    and an.name_type != 'pref'
    and an2.name_type = 'pref'
    and an.name != an2.name
    and gm.data_source_id = an_source.agent_id
    and an_source.agent_name_id = anr.agent_name_id
    and an_source.name_type = anr.name_type
    and an_source.name_type = 'pref'
    and anr.role_type = 'datasrc'
    order by short_identifier, preferred_fg desc;

    Notes

    Talend/ETL

    New Talend job by Yuteh: BG_org_groupTerm 0.1

    MERGE: Groups Associated Persons (repeating group) - Lam (SQL ready), Talend dev (Yuteh)

    Field mapping

    SQL

    --need to verify person_id = 0 records
    select
        gp.group_id as legacy_org_id,
        gp.person_id as legacy_affperson_id,
        an.name as affiliated_person,
        'urn:cspace:botgarden.cspace.berkeley.edu:personauthorities:name(person):item:name('
            + convert(varchar(10), gp.person_id) + ')''' +  an.name + ''''
        as affiliated_person_refname,
        'member' as affiliated_person_type
    from
        group_person gp,
        agent_name an
    where gp.person_id = an.agent_id
    and an.name_type = 'pref'

    Notes

    Talend/ETL

    New Talend job by Yuteh: BG_org_groupAssocPsn 0.1

    Taxonomy (default taxonomy vocabulary for verified, and possibly unverified, names) - Lam (SQL ready), Yuteh (Talend)

    Note: This query requires a table be refreshed on SAGE, cspace_lexicon_cv.

    Field mapping

    SQL

    -- records that do not have scope notes and are in taxon_name table
    select
        convert(varchar(10), l.lexicon_id) as short_identifier,
        l.term as display_name,
        case
            when l.node_type in (
                'cl', 'div', 'f', 'fam', 'gen', 'or', 'sect', 'sp',
                'subcl', 'subgen', 'subsect', 'subsp', 'var')
            then l.node_type
            else null
        end as taxon_rank,
        fam.term as family,
        case when fam.term is null or fam.term = ''
            then null
            else
                'urn:cspace:botgarden.cspace.berkeley.edu:taxonomyauthority:name(taxon):item:name('
                + convert(varchar(10), fam.ancestor_id) + ')''' + fam.term + ''''
        end as family_refname,
        tn.notho_taxon as hybrid_fg,
        case
            when tn.access_code is not null then
                case
                    when tn.access_code = 0 then 'Unrestricted'
                    when tn.access_code = 1 then 'Restricted'
                    when tn.access_code = 2 then 'Dead'
                end
            else
                case
                    when l.access_code = 0 then 'Unrestricted'
                    when l.access_code = 1 then 'Restricted'
                    when l.access_code = 2 then 'Dead'
                end
        end as access_restrictions,
        null as taxon_note                                                          
    from
        lexicon l,
        taxon_name tn,
        (select tc.descendant_id, tc.ancestor_id, l.term
            from trans_closure tc, lexicon l
            where tc.ancestor_id = l.lexicon_id
            and l.node_type = 'fam') fam
    where l.lexicon_id *= tn.taxon_id
    and l.lexicon_id *= fam.descendant_id
    and l.node_type not in ('common', 'bioname', 'na', 'root', 'cv')
    and l.lexicon_id not in (select lexicon_id from scope_notes)
    and l.lexicon_id not in (
        select lexicon_id from lexicon_term_relation
        where relation_type in ('TRV', 'TNA'))
    and lexicon_id not in (
    40017020, 40017429, 40031851, 40031856, 40031857, 40032337, 40012018, 40016672, 40010160, 40015069,
     40015070, 40029985, 40014456, 40014457, 40014462, 40014463, 40030806, 40014465, 40014466, 202870,
    40025519, 100064, 40012457, 40031859, 40013623, 20100090, 40032453, 40032457, 40034294, 40014967,
    40014471, 40014472, 20201711, 20201785, 40014468, 40014469, 20100116, 40016773, 40005868, 40033585,
     40033586, 40025813, 315305, 40014791, 40014792, 202729, 40032253, 40032254, 20100160, 40030847,
    40032250, 40032251, 40024254, 40030854, 40030855, 40033591, 40033592, 20202247, 40014474, 40014475,
    202261, 40021891, 40014477, 40014478, 40012355, 40017959, 40005017, 40014459, 40014460, 40030850,
     40014480, 40014481, 200287, 40022148, 40022149, 40014995, 40011239, 40011242, 40031768, 40034171,
    40011246, 40019386, 40028973, 40029003, 402302, 40020726, 40030845, 40030846, 40033588, 40033589,
    314246, 20201549, 40011993, 20401262, 20401279, 20401277, 20401278, 20401291, 20401273, 40021845,
     40020725, 40014483, 40014484, 40021609, 40005593, 309267, 201171, 40010997, 40020903)
    union
    -- records that have scope notes
    select
        convert(varchar(10), l.lexicon_id) as short_identifier,
        l.term as display_name,
        case
            when l.node_type in (
                'cl', 'div', 'f', 'fam', 'gen', 'or', 'sect', 'sp',
                'subcl', 'subgen', 'subsect', 'subsp', 'var')
            then l.node_type
            else null
        end as taxon_rank,
        fam.term as family,
        case when fam.term is null or fam.term = ''
            then null
            else
                'urn:cspace:botgarden.cspace.berkeley.edu:taxonomyauthority:name(taxon):item:name('
                + convert(varchar(10), fam.ancestor_id) + ')''' + fam.term + ''''
        end as family_refname,
        tn.notho_taxon as hybrid_fg,
        case
            when tn.access_code is not null then
                case
                    when tn.access_code = 0 then 'Unrestricted'
                    when tn.access_code = 1 then 'Restricted'
                    when tn.access_code = 2 then 'Dead'
                end
            else
                case
                    when l.access_code = 0 then 'Unrestricted'
                    when l.access_code = 1 then 'Restricted'
                    when l.access_code = 2 then 'Dead'
                end
        end as access_restrictions,
        convert(varchar(255), sn.scope_notes) as taxon_note
    from
        lexicon l,
        taxon_name tn,
        (select tc.descendant_id, tc.ancestor_id, l.term
            from trans_closure tc, lexicon l
            where tc.ancestor_id = l.lexicon_id
            and l.node_type = 'fam') fam,
        scope_notes sn
    where l.lexicon_id *= tn.taxon_id
    and l.lexicon_id *= fam.descendant_id
    and l.node_type not in ('common', 'bioname', 'na', 'root', 'cv')
    and l.lexicon_id = sn.lexicon_id
    and l.lexicon_id not in (
        select lexicon_id from lexicon_term_relation
        where relation_type in ('TRV', 'TNA'))
    union
    -- cultivar records without scope notes
    select
        convert(varchar(10), c.lexicon_id) as short_identifier,
        c.new_term as display_name,
        'cv' as taxon_rank,
        fam.term as family,
        case when fam.term is null or fam.term = ''
            then null
            else
                'urn:cspace:botgarden.cspace.berkeley.edu:taxonomyauthority:name(taxon):item:name('
                + convert(varchar(10), fam.ancestor_id) + ')''' + fam.term + ''''
        end as family_refname,
        tn.notho_taxon as hybrid_fg,
        case
            when tn.access_code is not null then
                case
                    when tn.access_code = 0 then 'Unrestricted'
                    when tn.access_code = 1 then 'Restricted'
                    when tn.access_code = 2 then 'Dead'
                end
            else
                case
                    when c.access_code = 0 then 'Unrestricted'
                    when c.access_code = 1 then 'Restricted'
                    when c.access_code = 2 then 'Dead'
                end
        end as access_restrictions,
        null as taxon_note
    from
        cspace_lexicon_cv c,
        taxon_name tn,
        (select tc.descendant_id, tc.ancestor_id, l.term
            from trans_closure tc, lexicon l
            where tc.ancestor_id = l.lexicon_id
            and l.node_type = 'fam') fam
    where c.lexicon_id *= tn.taxon_id
    and c.lexicon_id *= fam.descendant_id
    and c.lexicon_id not in (select lexicon_id from scope_notes)
    and c.lexicon_id not in (
        select lexicon_id from lexicon_term_relation
        where relation_type in ('TRV', 'TNA'))
    union
    -- cultivar records that have scope notes
    select
        convert(varchar(10), c.lexicon_id) as short_identifier,
        c.new_term as display_name,
        'cv' as taxon_rank,
        fam.term as family,
        case when fam.term is null or fam.term = ''
            then null
            else
                'urn:cspace:botgarden.cspace.berkeley.edu:taxonomyauthority:name(taxon):item:name('
                + convert(varchar(10), fam.ancestor_id) + ')''' + fam.term + ''''
        end as family_refname,
        tn.notho_taxon as hybrid_fg,
        case
            when tn.access_code is not null then
                case
                    when tn.access_code = 0 then 'Unrestricted'
                    when tn.access_code = 1 then 'Restricted'
                    when tn.access_code = 2 then 'Dead'
                end
            else
                case
                    when c.access_code = 0 then 'Unrestricted'
                    when c.access_code = 1 then 'Restricted'
                    when c.access_code = 2 then 'Dead'
                end
        end as access_restrictions,
        convert(varchar(255), sn.scope_notes) as taxon_note
    from
        cspace_lexicon_cv c,
        taxon_name tn,
        (select tc.descendant_id, tc.ancestor_id, l.term
            from trans_closure tc, lexicon l
            where tc.ancestor_id = l.lexicon_id
            and l.node_type = 'fam') fam,
        scope_notes sn
    where c.lexicon_id *= tn.taxon_id
    and c.lexicon_id *= fam.descendant_id
    and c.lexicon_id = sn.lexicon_id
    and c.lexicon_id not in (
        select lexicon_id from lexicon_term_relation
        where relation_type in ('TRV', 'TNA'))
    order by short_identifier;
    

    Notes

    • Not getting author shortIDs in view, just names. For now putting them in Taxon Notes. Adjust view or get them separately in SQL.
    • Do we need to adjust the format of scientific names?
    • Lots of other Sage data for Names to add
    • SQL query above is a union of names without scope notes and names with scope notes.  the query that added an outer join with the scope notes table took over 50 minutes to run, so it was easier to split it out into a union of two queries.
    • SQL query above excludes the following node_types: bioname, common, na, root.  It also excludes TRE, TRV, and TNA terms. TNA terms will be added as an alternate name of type 'no author name'.  TRE names are all bionames, so the bioname exclusion will also exclude TRE names.
    • This query requires a table be refreshed on SAGE, cspace_lexicon_cv.

    Talend/ETL

    bgtaxonimport2 0.2

    MERGE: Taxon Terms (repeating group) - Lam (SQL ready), Yuteh (Talend)

    Field mapping

    SQL

    Note: This query requires a table be refreshed on SAGE, cspace_lexicon_cv.

    -- verified names
    select
        convert(varchar(10), l.lexicon_id) as short_identifier,
        l.term as display_name,
        l.term as term_name,
        'Verified Taxonomic Name' as term_type,
        an.name as term_source,                                                     
        an.agent_id as term_source_uid,
        'accepted' as term_status,
        case
            when tn.catalog_flag = 1 then 'valid'
            else 'invalid'
        end as taxonomic_status,
        1 as preferred_flag
    from
        lexicon l,
        taxon_name tn,
        agent_name an,
        agent_name_role anr
    where l.lexicon_id = tn.taxon_id
    and tn.data_src_id = an.agent_id
    and an.agent_id = anr.agent_id
    and an.agent_name_id = anr.agent_name_id
    and an.name_type = anr.name_type
    and an.name_type = 'pref'
    and anr.role_type = 'datasrc'
    and l.node_type not in ('common', 'bioname', 'na', 'root', 'cv')
    union
    -- no author names
    select
        convert(varchar(10), rl.lexicon_id) as short_identifier,
        l.term as display_name,
        l.term as term_name,
        'Taxon No Author Name' as term_type,
        null as term_source,
        null as term_source_uid,
        'accepted' as term_status,
        case
            when l.valid_catalog_term_flag = 1 then 'valid'
            else 'invalid'
        end as taxonomic_status,
        0 as preferred_flag
    from
        lexicon l,
        lexicon_term_relation ltr,
        lexicon rl
    where l.lexicon_id = ltr.lexicon_id
    and ltr.relation_type = 'TNA'
    and ltr.related_lexicon_id = rl.lexicon_id
    and l.node_type != 'cv'
    and l.term != rl.term
    union
    -- remaining 821 terms that are not verified terms and are not no author names:
    select
        convert(varchar(10), l.lexicon_id) as short_identifier,
        l.term as display_name,
        l.term as term_name,
        'Unverified Name' as term_type,
        null as term_source,
        null as term_source_uid,
        'accepted' as term_status,
        case
            when l.valid_catalog_term_flag = 1 then 'valid'
            else 'invalid'
        end as taxonomic_status,
        0 as preferred_flag
    from
        lexicon l
    where l.lexicon_id not in (
        select taxon_id from taxon_name)
    and l.node_type not in ('bioname', 'common', 'root', 'na', 'cv')
    and l.lexicon_id not in (
        select lexicon_id from lexicon_term_relation where relation_type = 'TNA')
    and l.lexicon_id not in (
        select lexicon_id from lexicon_term_relation where relation_type = 'TRV')
    union
    -- cv verified names
    select
        convert(varchar(10), c.lexicon_id) as short_identifier,
        c.new_term as display_name,
        c.new_term as term_name,
        'Verified Taxonomic Name' as term_type,
        an.name as term_source,
        an.agent_id as term_source_uid,
        'accepted' as term_status,
        case
            when tn.catalog_flag = 1 then 'valid'
            else 'invalid'
        end as taxonomic_status,
        1 as preferred_flag
    from
        cspace_lexicon_cv c,
        taxon_name tn,
        agent_name an,
        agent_name_role anr
    where c.lexicon_id = tn.taxon_id
    and tn.data_src_id = an.agent_id
    and an.agent_id = anr.agent_id
    and an.agent_name_id = anr.agent_name_id
    and an.name_type = anr.name_type
    and an.name_type = 'pref'
    and anr.role_type = 'datasrc'
    union
    -- cv no author names
    select
        convert(varchar(10), rl.lexicon_id) as short_identifier,
        c.new_term as display_name,
        c.new_term as term_name,
        'Taxon No Author Name' as term_type,
        null as term_source,
        null as term_source_uid,
        'accepted' as term_status,
        case
            when c.valid_catalog_term_flag = 1 then 'valid'
            else 'invalid'
        end as taxonomic_status,
        0 as preferred_flag
    from
        cspace_lexicon_cv c,
        lexicon_term_relation ltr,
        cspace_lexicon_cv rl
    where c.lexicon_id = ltr.lexicon_id
    and ltr.relation_type = 'TNA'
    and ltr.related_lexicon_id = rl.lexicon_id
    and c.new_term != rl.new_term
    union
    -- remaining 821 terms that are not verified terms and are not no author names:
    select
        convert(varchar(10), c.lexicon_id) as short_identifier,
        c.new_term as display_name,
        c.new_term as term_name,
        'Unverified Name' as term_type,
        null as term_source,
        null as term_source_uid,
        'accepted' as term_status,
        case
            when c.valid_catalog_term_flag = 1 then 'valid'
            else 'invalid'
        end as taxonomic_status,
        0 as preferred_flag
    from
        cspace_lexicon_cv c
    where c.lexicon_id not in (
        select taxon_id from taxon_name)
    and c.lexicon_id not in (
        select lexicon_id from lexicon_term_relation where relation_type = 'TNA')
    and c.lexicon_id not in (
        select lexicon_id from lexicon_term_relation where relation_type = 'TRV')
    union
    -- grex names
    select
        convert(varchar(10), l.lexicon_id) as short_identifier,
        rtrim(str_replace(substring(l.term, 1, patindex('% ''%''', l.term)), ' cv. ', ' ')) as display_name,
        rtrim(str_replace(substring(l.term, 1, patindex('% ''%''', l.term)), ' cv. ', ' ')) as term_name,
        'Grex Name' as term_type,
        an.name as term_source,
        an.agent_id as term_source_uid,
        'accepted' as term_status,
        case
            when tn.catalog_flag = 1 then 'valid'
            else 'invalid'
        end as taxonomic_status,
        0 as preferred_flag
    from
        lexicon l,
        taxon_name tn,
        agent_name an,
        agent_name_role anr
    where l.lexicon_id = tn.taxon_id
    and tn.data_src_id = an.agent_id
    and an.agent_id = anr.agent_id
    and an.agent_name_id = anr.agent_name_id
    and an.name_type = anr.name_type
    and an.name_type = 'pref'
    and anr.role_type = 'datasrc'
    and l.node_type = 'cv'
    and l.term like '%cv. _%''%'''
    and l.term not like 'cv. %'
    and l.lexicon_id != 40025687
    order by preferred_flag desc, short_identifier;
    
     

    Notes

    • Note: This query requires a table be refreshed on SAGE, cspace_lexicon_cv.

    Talend/ETL

    MERGE: Taxon Authors (repeating group) - Lam (SQL ready)

    Field mapping

    SQL

    select
        tn.taxon_id as legacy_id,
        an.agent_id as agent_id,
        tn.author_id as author_name_id,
        an.name as author_name,
        'author' as author_type,
        1 as author_order
    from
        taxon_name tn,
        agent_name an
    where tn.author_id = an.agent_name_id
    and an.name_type = 'kewabbr' and tn.author_id <> 5
    union
    select
        tn.taxon_id as legacy_id,
        an.agent_id as agent_id,
        tn.ascr_auth_id as author_name_id,
        an.name as author_name,
        'ascribed author' as author_type,
        2 as author_order
    from
        taxon_name tn,
        agent_name an
    where tn.ascr_auth_id = an.agent_name_id
    and an.name_type = 'kewabbr' and tn.ascr_auth_id <> 5
    union
    select
        tn.taxon_id as legacy_id,
        an.agent_id as agent_id,
        tn.p_author_id as author_name_id,
        an.name as author_name,
        'parenthetical author' as author_type,
        3 as author_order
    from
        taxon_name tn,
        agent_name an
    where tn.p_author_id = an.agent_name_id
    and an.name_type = 'kewabbr' and tn.p_author_id <> 5
    union
    select
        tn.taxon_id as legacy_id,
        an.agent_id as agent_id,
        tn.p_ascr_auth_id as author_name_id,
        an.name as author_name,
        'parenthetical ascribed author' as author_type,
        4 as author_order
    from
        taxon_name tn,
        agent_name an
    where tn.p_ascr_auth_id = an.agent_name_id
    and an.name_type = 'kewabbr' and tn.p_ascr_auth_id <> 5
    order by legacy_id, author_order
    

    Notes

    Talend/ETL

    MERGE: Taxon Common Names (repeating group) - Lam (SQL ready)

    Field mapping

    Need to get relationship type to and map to values in CSpace: Preferred, Secondary, Rejected

    SQL

    select
        ltr.related_lexicon_id as legacy_id,
        ltr.lexicon_id as common_name_id,
        l.term as common_name,
        'English' as language,
        case when ltr.relation_type = 'TCN' then 'preferred'
             when ltr.relation_type = 'TCNS' then 'secondary'
             when ltr.relation_type = 'TCNR' then 'rejected'
             end as common_name_type
    from
        lexicon_term_relation ltr,
        lexicon l
    where ltr.lexicon_id = l.lexicon_id
    and ltr.relation_type in ('TCN', 'TCNS', 'TCNR')
    

    Notes

    Talend/ETL

    MERGE: Taxon Attributes (repeating group) - Lam (SQL ready)

    E.g., dimensions, habitat, etc.

    Field mapping

    SQL

    select
        pa.lexicon_id as legacy_id,
        pa.date_entered as attribute_date,
        pa.agent_id as recorded_by_id,
        rb.name as recorded_by_name,
        pa.plant_height as height,
        pa.plant_width as width,
        pa.plant_dbh as dbh,
        pa.plant_habit as habitat,
        pa.climate_rating as climate_rating,
        ca.agent_id as conservation_agent_id,
        ca.name as conservation_agent_name,
        cc.conserv_cat as convservation_code,
        case
         when ca.name = 'California Native Plant Society' then 'CNPS'
         when ca.name = 'not applicable' then ''
         when ca.name = 'International Union for Conservation of Nature and Natural Resources - 1997' then 'IUCN-1997'
    --   when ca.name = 'Convention on International Trade in Endangered Species' then 'CNPS'
         when ca.name = 'Convention on International Trade in Endangered Species' then 'CITES'
         when ca.name = 'Rhododendron Species Foundation' then 'RSF'
    --   when ca.name = 'International Union for Conservation of Nature and Natural Resources - 2006' then 'CNPS'
         when ca.name = 'International Union for Conservation of Nature and Natural Resources - 2006' then 'IUCN-2006'
         when ca.name = 'South African National Biodiversity Institute' then 'SANBI'
         when ca.name = 'UNEP World Conservation Monitoring Center - CITES' then 'UNEP-CITES'
         when ca.name = 'New York Botanical Garden' then 'NYBG'
         when ca.name = 'US Fish & Wildlife Service' then 'USFWS'
         when ca.name = 'California Department of Fish & Game' then 'CADFG'
         when substring(cc.comp_key, 1, 5) = '60801' or ca.name = 'IUCN/SSC Global Tree Specialist Group, Red List of Rhododendrons - 2011' then 'IUCN-Rhododendrons-2011'
         when substring(cc.comp_key, 1, 5) = '60804' or ca.name = 'International Union for Conservation of Nature and Natural Resources - 2011.1' then 'IUCN-2011.1'
         when substring(cc.comp_key, 1, 5) = '60849' or ca.name = 'International Union for Conservation of Nature and Natural Resources - 2011.2' then 'IUCN-2011.2'
         when substring(cc.comp_key, 1, 5) = '60905' or ca.name = 'International Union for Conservation of Nature and Natural Resources - 2012.1' then 'IUCN-2012.1'
         when substring(cc.comp_key, 1, 5) = '60917' or ca.name = 'International Union for Conservation of Nature and Natural Resources - 2012.2' then 'IUCN-2012.2'
         when ca.name = 'Non-conservation publication' then 'Non-conservation pub'
         when cc.conserv_id in (38, 39, 41, 42, 47) then  ''
         end as conservCode_append,
        pa.frost_sensitive as frost_sensitive,
        pa.medicinal_flag as medicinal_use,
        pa.economic_flag as economic_use
    from
        plant_attr pa,
        agent_name rb,
        agent_name ca,
        conserv_code cc,
        agent_name_role anr
    where pa.agent_id = rb.agent_id
    and rb.name_type = 'pref'
    and pa.conserv_id = cc.conserv_id
    and cc.comp_key *= anr.comp_key
    and cc.role_type *= anr.role_type
    and cc.role_type = 'conserv'
    and anr.agent_id *= ca.agent_id
    and anr.agent_name_id *= ca.agent_name_id
    and anr.name_type *= ca.name_type
    and ca.name_type = 'pref'
    

    Notes

    Talend/ETL

    MERGE: Taxon Related Terms (repeating group) - Lam (SQL ready)

    Might be needed for relationships to synonyms, unverified terms, verified terms (?), to hybrid parent names (?), etc.

    Field mapping

    SQL

    select
        ltr.related_lexicon_id as legacy_id,
        ltr.lexicon_id as related_term_id,
        l.term as related_term,
        ltr.relation_type as relation_type
        'urn:cspace:botgarden.cspace.berkeley.edu:taxonomyauthority:name(taxon):item:name('
            + convert(varchar(10), ltr.lexicon_id) + ')''' + l.term + ''''
            as related_term_refname
    from
        lexicon_term_relation ltr,
        lexicon l
    where ltr.lexicon_id = l.lexicon_id
    and ltr.relation_type in ('FHP', 'MHP', 'SYN')
    and l.node_type != 'cv'
    union
    -- cv relations
    select
        ltr.related_lexicon_id as legacy_id,
        ltr.lexicon_id as related_term_id,
        c.term as related_term,
        ltr.relation_type as relation_type
        'urn:cspace:botgarden.cspace.berkeley.edu:taxonomyauthority:name(taxon):item:name('
            + convert(varchar(10), ltr.lexicon_id) + ')''' + c.term + ''''
            as related_term_refname
    from
        lexicon_term_relation ltr,
        cspace_lexicon_cv c
    where ltr.lexicon_id = c.lexicon_id
    and ltr.relation_type in ('FHP', 'MHP', 'SYN')
    order by legacy_id, relation_type;
    

    Notes

    Talend/ETL

    Taxonomic Hierarchy (hierarchical relations) - Lam (SQL ready)

    Based on parent taxa identified in taxon table

    Field mapping

    SQL

    select
        taxon_id as short_identifier,
        parent_id as parent_id
    from
        taxon_name tn,
        lexicon l
    where tn.taxon_id = l.lexicon_id
    and l.node_type not in ('common', 'bioname', 'na', 'root') and parent_id <> 0
    order by short_identifier;
    

    Notes

    Talend/ETL

    Common Taxonomic Names (common vocabulary in taxonomy authority) - Lam (SQL ready), Yuteh (Talend)

    Should get Access Code in this SQL too, right?

    Field mapping

    SQL

    -- this query includes 16 common name duplicates.
    -- duplicates are referenced in lexicon_term_relation.lexicon_id where relation_type in ('TCN', 'TCNS', 'TCNR')
    select
        l.lexicon_id as short_identifier,
        l.term  as display_name,
        case
                    when l.access_code = 0 then 'Unrestricted'
                    when l.access_code = 1 then 'Restricted'
                    when l.access_code = 2 then 'Dead'
        end  as access_restrictions
    from
        lexicon l
    where l.node_type = 'common';
    

    Notes

    exact duplicates (across term, access_code, valid_catalog_term_flag):

    select count(*), term
    from lexicon where node_type = 'common'
    group by term, access_code, valid_catalog_term_flag
    having count(*) > 1
    order by term

    2 Arnica
    2 California-Lilac
    2 crimson glory vine
    2 Hakeke, Mountain Holly
    2 Horoeka, Lancewood
    2 Horopito, Pepper Tree
    2 Lion's Spoor
    2 Monkeyflower
    2 Mountain Rimu, Pigmy pine
    2 Mountain totara, Snow totara
    2 Papapa, Snowberry
    2 Tanekaha, Celery pine
    2 Tarata, Lemonwood
    2 Wi Kura, Narrow-leaved Snow Tussock

    These two names are term duplicates (access_code or valid_catalog_term_flag are different)

    2 SECTION CHONIASTRUM
    2 ti kauka, Cabbage tree

    Here are the duplicates that are referenced in lexcon_term_relation for common names:

    select distinct l.lexicon_id, l.term
    from lexicon l, lexicon_term_relation ltr
    where l.lexicon_id = ltr.lexicon_id
    and relation_type in ('TCN', 'TCNS', 'TCNR')
    and l.term in (
            select term
            from lexicon
            where node_type = 'common'
            group by term
            having count(*) > 1)
    order by term, lexicon_id;

    40000115 Arnica
    40002353 Arnica
    40000552 California-Lilac
    40000588 California-Lilac
    40006165 Crimson Glory Vine
    40001587 Hakeke, Mountain Holly
    40001588 Hakeke, Mountain Holly
    40001687 Horoeka, Lancewood
    40001688 Horoeka, Lancewood
    40001690 Horopito, Pepper Tree
    40001691 Horopito, Pepper tree
    40024379 Lion's Spoor
    40024418 Lion's Spoor
    40002311 Monkeyflower
    40002314 Monkeyflower
    40002384 Mountain Rimu, Pigmy pine
    40002385 Mountain Rimu, Pigmy Pine
    40002391 Mountain totara, Snow totara
    40002392 Mountain totara, Snow totara
    40002615 Papapa, Snowberry
    40002616 Papapa, Snowberry
    40003741 Tanekaha, Celery pine
    40003742 Tanekaha, Celery pine
    40003751 Tarata, Lemonwood
    40003752 Tarata, Lemonwood
    40003809 Ti kauka, Cabbage tree
    40004151 Wi Kura, Narrow-leaved Snow Tussock
    40004152 Wi Kura, Narrow-leaved snow tussock

    Note: 40004151 and 40004152 are duplicates because of case insensitivity in Sybase db.

    Talend/ETL

    Plant Sales taxonomic vocabulary (plantsales vocabulary in taxonomy authority)

    We will be constructing a taxonomy vocabulary (third instance) called Plant Sales that will be used to support the Pot Tag procedure.  This will be made up of the unique taxonomic names in SAGE's volunteer_label table (using the formatted name field).

    Question: Will we also use family names from the volunteer_label table?

    Field Mapping

    ...

    SQL

    /*
    -- View to get smallest label_id as the short identifier for the formatted_taxon name.
    -- Manually excludes 7 sets of duplicates due to the taxon name being a valid genus name as well as a species name.
    -- The name in volunteer_label.formatted_taxon is the genus name, but due to the Sybase db being case insensitive, it creates a false match.
    
    create view cspace_formatted_taxon_id_rank as
    select
        min(vl.label_id) as label_id,
        vl.formatted_taxon,
        l.node_type,
        l.term
    from volunteer_label vl, lexicon l
    where vl.formatted_taxon *= l.term
    and l.node_type not in ('bioname', 'common', 'na', 'root')
    and l.lexicon_id not in (
        20300328, 20300815, 20302343, 20302378, 40022302, 20304254, 20304427)
    group by vl.formatted_taxon, l.node_type, l.term;
    
    */
    
    -- Query to get the plant sales taxon main record.
    -- There is only one term name, so no need for a second query for term names for xml merge.
    
    select distinct
        id_rank.label_id as short_identifier,
        vl.formatted_taxon as display_name,
        case
            when id_rank.node_type is not null then id_rank.node_type
            else
                case
                    when vl.formatted_taxon like '% cv. %' then 'cv'
                    when vl.formatted_taxon like '% var. %' then 'var'
                    when (vl.formatted_taxon like '% f. %'
                        and vl.formatted_taxon not like '% F. Ritter%'
                        and vl.formatted_taxon not like '% F. Muell.%'
                        and vl.formatted_taxon not like '% F. Dietr.%'
                        and vl.formatted_taxon not like '% F. Schmidt%'
                        and vl.formatted_taxon not like '% F. Phil.%'
                        and vl.formatted_taxon not like '% F. Schwarz%') then 'f'
                    when vl.formatted_taxon like '% subsp. %' then 'subsp'
                    else 'sp'
                end
        end as taxon_rank,
        vl.formatted_taxon as term_name,
        null as term_type,
        'accepted' as term_status,
        'valid' as taxonomic_status
    from
        volunteer_label vl,
        cspace_formatted_taxon_id_rank id_rank
    where vl.formatted_taxon *= id_rank.formatted_taxon
    and vl.formatted_taxon is not null;
    

    Notes

    ...

    Talend/ETL

    ...

    Garden Locations (Storage locations vocabulary) - SQL ready, Talend (Chris)

    Field mapping

    SQL

    SELECT
        'garden'||convert(varchar(5), garden_loc_id) shortidentifier,
        loc_type locationType,
        ltrim(full_loc_name) locationDisplayName,
        garden_loc_id legacy_id,
        parent_id,
        loc_hier termSourceNote,
        ltrim(loc_name) locationName,
        entered_staff_id,
        date_entered,
        last_change_staff_id,
        last_change_date
    FROM
        dbo.garden_location

    Notes

    • Location Types: bed, section, collection
    • Location Type format: <locationType>urn:cspace:botgarden.cspace.berkeley.edu:vocabularies:name(locationtype):item:name(drawer)'Drawer'</locationType>
    • Vocabulary CSID to persist: c2833911-4bca-49aa-a3f3
    • Refname format: <refName>urn:cspace:botgarden.cspace.berkeley.edu:locationauthorities:name(location):item:name(crhbed11350962164634)'crhbed1'</refName>

    Talend/ETL

    bgimportgardenlocation 0.2 in Chris's Talend botgarden repository

    Rename output file from ucbggardenlocation job to gardenlocout.txt!!

    Garden Locations Hierarchy (hierarchical relationships) - SQL ready, Talend (Chris)

    Field mapping

    SQL

    Notes

    Talend/ETL

    Rename output file from ucbggardenlocation job to gardenlocout.txt!!

    ucbglocrelations v0.1in Chris's Talend botgarden repository

    Job creates 571 relations, but one needs to be removed.  garden0/'none' has no broader record but has output in the file.  Remove by hand.  Looked like:

    <import service="Relations" type="Relation" CSID="1fde0aff-dbb0-4446-9745-eaa6df4039ae">
      <schema xmlns:relations_common="http://collectionspace.org/relation" name="relations_common">
        <relationshipType>hasBroader</relationshipType>
        <subjectDocumentType>Locationitem</subjectDocumentType>
        <subjectCsid>96fbf471-1f5f-4dab-8a25-516f0eab3e5b</subjectCsid>
        <subjectUri>/locationauthorities/dbf254e8-437d-4926-be5d/items/96fbf471-1f5f-4dab-8a25-516f0eab3e5b</subjectUri>
        <subjectRefName>urn:cspace:botgarden.cspace.berkeley.edu:locationauthorities:name(location):item:name(garden0)'none'</subjectRefName>
        <objectDocumentType>Locationitem</objectDocumentType>
        <objectCsid/>
        <objectUri/>
        <objectRefName/>
      </schema>
    </import>

    Place Authority - Lam (SQL), Yuteh (Talend)

    Field mapping

    SQL

    select
        gl.geog_lex_id as place_id,
        gl.geog_term as display_name,
        gtc.geog_type_name as place_type,
        gn.notes as place_note
    from
        geog_lexicon gl,
        geog_type_code gtc,
        geog_name gn
    where gl.geog_lex_id not in (
        select geog_lex_id
        from geog_term_relation
        where relation_type in ('SYN', 'RQN'))
    and gl.geog_type not in ('element', 'root')
    and gl.geog_type = gtc.geog_type
    and gl.geog_lex_id *= gn.geog_id;
    

    Notes

    Talend/ETL

    MERGE: Place terms (repeating group)

    Field mapping

    SQL

    -- geography term name repeating group
    -- verified geographic names
    select
        gl.geog_lex_id as place_id,
        gl.geog_term as display_name,
        gl.geog_term as term_name,
        'English' as term_language,
        gn.iso_country_cd as abbreviation,
        'Verified Geographic Name' as term_type,
        gn.data_src_id as term_source_id,
        an.name as term_source,
        gn.valid_term_flag as term_status,
        1 as preferred_flag
    from
        geog_lexicon gl,
        geog_name gn,
        agent_name an
    where gl.geog_lex_id = gn.geog_id
    and gn.data_src_id = an.agent_id
    and an.name_type = 'pref'
    union
    -- short names
    select
        gn.geog_id as place_id,
        gl.geog_term as display_name,
        gl.geog_term as term_name,
        'English' as term_language,
        gn.iso_country_cd as abbreviation,
        'Short Name' as term_type,
        gn.data_src_id as term_source_id,
        an.name as term_source,
        0 as term_status,
        0 as preferred_flag
    from
        geog_lexicon gl,
        geog_name gn,
        agent_name an
    where gl.geog_lex_id = gn.geog_element_id
    and gn.data_src_id = an.agent_id
    and an.name_type = 'pref'
    union
    -- reverse qualified names and synonyms                                         
    select
        gtr.related_geog_lex_id as place_id,
        gl.geog_term as display_name,
        gl.geog_term as term_name,
        'English' as term_language,
        null as abbreviation,
        case
            when gtr.relation_type = 'RQN' then 'Reverse Qualified Name'
            when gtr.relation_type = 'SYN' then 'Synonym'
            else null
        end as term_type,
        null as term_source_id,
        null as term_source,
        gl.valid_term_flag as term_status,
        0 as preferred_flag
    from
        geog_term_relation gtr,
        geog_lexicon gl,
        geog_lexicon rgl
    where gtr.geog_lex_id = gl.geog_lex_id
    and gtr.related_geog_lex_id = rgl.geog_lex_id
    and gtr.relation_type in ('RQN', 'SYN')
    and gl.geog_term != rgl.geog_term
    union
    -- remaining 71 geog terms that are not verified names, reverse qualified names, or synonyms
    select
        gl.geog_lex_id as place_id,
        gl.geog_term as display_name,
        gl.geog_term as term_name,
        'English' as term_language,
        null as abbreviation,
        'Unverified Geographic Name' as term_type,
        null as term_source_id,
        null as term_source,
        gl.valid_term_flag as term_status,
        1 as preferred_flag
    from
        geog_lexicon gl
    where gl.geog_lex_id not in (
        select geog_lex_id from geog_term_relation where relation_type in ('RQN', 'SYN'))
    and gl.geog_lex_id not in (
        select geog_id from geog_name)
    and gl.geog_type not in ('element', 'root')
    order by place_id, preferred_flag, display_name;
    

    Notes

    Talend/ETL

    Place hierarchy (hierarchical relationships)

    Field mapping

    SQL

    select
        gn.geog_id as place_id,
        case
            when gn.geog_parent_id = 0 then null
            else gn.geog_parent_id
        end as parent_id
    from
        geog_name gn,
        geog_lexicon gl
    where gl.geog_lex_id = gn.geog_id
    and gl.geog_type not in ('element', 'root');
    

    Notes

    Talend/ETL

    Research Projects (Concept vocabulary) - Chris (SQL ready), Talend (Yuteh)

    Field mapping

    SQL

    SELECT
        project_id legacy_id,
        'researchproject'||convert(varchar(20), project_id) shortIdentifier,
        project_name conceptName,
        person_id conceptPerson,
        org_id conceptOrganization,
        agent_id createdBy,
        date_entered createdAt,
        last_change_date updatedAt,
        last_change_agent_id updatedBy
    FROM
        dbo.research_project

    Notes

    • refname format: urn:cspace:botgarden.cspace.berkeley.edu:conceptauthorities:name(research_ca):item:name(research11350934474341)'research1'
    • vocabulary CSID to persist: deb53f97-9829-47b9-99cb

    Talend/ETL

    New Talend job by Yuteh: BG_concept_researchProj 0.1

    Classes (Concept vocabulary) - Chris (SQL ready), Talend (Yuteh)

    Field mapping

    SQL

    SELECT
        course_id legacy_id,
        'class'||convert(varchar(20), course_id) shortIdentifier,
        course_title conceptName,
        ad.name||' '||course_num conceptTerm_coursenum,
        person_id conceptPerson,
        cu.org_id conceptOrganization_dept,
        ad.name deptName,
        inst_id institution_orgParent,
        ai.name conceptNote_instName,
        semester conceptQualifier_semester,
        cu.agent_id createdBy,
        cu.date_entered createdAt,
        cu.last_change_date updatedAt,
        cu.last_change_agent_id updatedBy
    FROM
        dbo.class_use cu
    left outer join agent_name ad on (cu.org_id = ad.agent_id)
    left outer join agent_name ai on (cu.inst_id = ai.agent_id)

    Notes

    • Institution (inst_id) is treated as the parent institution of the department (org_id).  However, this does not necessarily match what is recorded in the agent tables.  We will put the institution name in the scope notes field.
    • refname format: urn:cspace:botgarden.cspace.berkeley.edu:conceptauthorities:name(class_ca):item:name(crhtestclass11350934263503)'crhtestclass1'
    • vocabulary CSID to persist: 96141839-4bac-4e3d-a511

    Talend/ETL

    New Talend job by Yuteh: BG_concept_class 0.1

    Propagation medium (Concept vocabulary) - Chris (SQL ready), Talend (Yuteh)

    Field mapping

    SQL

    SELECT
        case when medium_description is null or medium_description = '' then medium_name
          else medium_description end as conceptDisplayName,
        medium_name conceptTerm,
        entered_staff_id,
        date_entered,
        last_change_staff_id,
        last_change_date
    FROM
        dbo.medium
    order by medium_description

    Notes

    • Construct the shortidentifier as 'medium' + an incrementing number, in Talend
    • Refname format: urn:cspace:botgarden.cspace.berkeley.edu:conceptauthorities:name(concept):item:name(medium11350937075031)'medium1'
    • Vocabulary CSID to persist: bbe39bb2-a622-4570-a946

    Talend/ETL

    New Talend job by Yuteh: BG_concept_propgMedium 0.1

    Conservation Category (concept vocabulary) - Chris (SQL ready), Talend (Yuteh)

    Field mapping

    SQL

    SELECT
        conserv_id,
        'conserv'||convert(varchar(5), conserv_id) shortIdentifier,
        comp_key,
        substring(comp_key, 1, 5) consagent_name_id,
        an.name agentname,
        rtrim(conserv_cat) conserv_code,
        case when an.name = 'California Native Plant Society' then 'CNPS'
         when an.name = 'not applicable' then ''
         when an.name = 'International Union for Conservation of Nature and Natural Resources - 1997' then 'IUCN-1997'
    --   when an.name = 'Convention on International Trade in Endangered Species' then 'CNPS'
         when an.name = 'Convention on International Trade in Endangered Species' then 'CITES'
         when an.name = 'Rhododendron Species Foundation' then 'RSF'
    --   when an.name = 'International Union for Conservation of Nature and Natural Resources - 2006' then 'CNPS'
         when an.name = 'International Union for Conservation of Nature and Natural Resources - 2006' then 'IUCN-2006'
         when an.name = 'South African National Biodiversity Institute' then 'SANBI'
         when an.name = 'UNEP World Conservation Monitoring Center - CITES' then 'UNEP-CITES'
         when an.name = 'New York Botanical Garden' then 'NYBG'
         when an.name = 'US Fish & Wildlife Service' then 'USFWS'
         when an.name = 'California Department of Fish & Game' then 'CADFG'
         when substring(comp_key, 1, 5) = '60801' or an.name = 'IUCN/SSC Global Tree Specialist Group, Red List of Rhododendrons - 2011' then 'IUCN-Rhododendrons-2011'
         when substring(comp_key, 1, 5) = '60804' or an.name = 'International Union for Conservation of Nature and Natural Resources - 2011.1' then 'IUCN-2011.1'
         when substring(comp_key, 1, 5) = '60849' or an.name = 'International Union for Conservation of Nature and Natural Resources - 2011.2' then 'IUCN-2011.2'
         when substring(comp_key, 1, 5) = '60905' or an.name = 'International Union for Conservation of Nature and Natural Resources - 2012.1' then 'IUCN-2012.1'
         when substring(comp_key, 1, 5) = '60917' or an.name = 'International Union for Conservation of Nature and Natural Resources - 2012.2' then 'IUCN-2012.2'
         when an.name = 'Non-conservation publication' then 'Non-conservation pub'
         when conserv_id in (38, 39, 41, 42, 47) then ''
         end as conservCode_append,
        cat_descr conserv_descr,
        case when red_dot = 1 then 'red dot on label'
          end as scope_note_red_dot
    FROM
        dbo.conserv_code c
    left outer join agent_name an on (substring(comp_key, 1, 5) = convert(varchar(20), an.agent_id) and an.name_type = 'pref')
    -- where conserv_id not in (38, 39, 41, 42, 47)
    order by comp_key

    Notes

    Conservation Categories are codes that are tied to Conservation Organizations.  Each organization can have its own list of conservation codes, and these can and do change over time.  Most have a code of 'rare'. 

    Need to figure out what display name will be.  Should the display name in the vocabulary be like: "EN IUCN-1997", "IUCN-1997: Endangered - meets criteria for endangered and is facing a high risk of extinction in the wild".  Think about how term completion will work.  Might need to derive some organization abbreviations for these display names.  Consider these "endangered" categories from different organizations (org, code, description):

    EN        	International Union for Conservation of Nature and Natural Resources - 1997	Taxa in danger of extinction and whose survival is unlikely if the causal factors continue operating
    EN        	International Union for Conservation of Nature and Natural Resources - 2006	(null)
    EN        	UNEP World Conservation Monitoring Center - CITES	(null)
    EN        	(null)	(null)
    EN        	New York Botanical Garden	(null)
    EN        	(null)	(null)
    EN        	IUCN/SSC Global Tree Specialist Group, Red List of Rhododendrons - 2011	Endangered - meets criteria for endangered and is facing a high risk of extinction in the wild
    EN        	International Union for Conservation of Nature and Natural Resources - 2011.1	endangered
    EN        	Non-conservation publication	endangered
    EN        	International Union for Conservation of Nature and Natural Resources - 2011.2	vulnerable
    EN        	International Union for Conservation of Nature and Natural Resources - 2012.1	endangered
    Endangered	US Fish & Wildlife Service	Endangered Species Act listing
    Endangered	California Department of Fish & Game	(null)

    Talend/ETL

    New Talend job by Yuteh: BG_concept_conservCatg 0.1

    Collection Objects

    Accessions - SQL ready (Chris), Talend dev (Chris), Data botgarden-dev loaded (Chris)

    Field mapping

    SQL

    SELECT
        rtrim(a.accession_number) objectnumber,
        convert(date, accession_date) accessionDate,
        case when dead_flag=1 then 'true' else 'false' end as dead_flg,
        convert(date, dead_date) deadDate,
        material_type briefDescMaterial,
        case
          when data_quality = 1 then 'Complete'
          when data_quality = 0 then 'Unknown'
        end as recordStatusQuality,
        material_source,
        case when requested_by <> 0 then requested_by end as requested_by,
        case when collector_id <> 0 then collector_id end as collector_id,
        rtrim(collector_number) collector_number,
        collection_date,
        case when begin_date > 0 then dateadd(dd,(begin_date - 2415021),
            convert(date, '1900-01-01', 103))
        end as begin_date_scalar,
        case when begin_date > 0 then datepart(year, dateadd(dd,(begin_date - 2415021),
            convert(datetime, '1900-01-01', 103)))
        end as begin_date_year,
        case when begin_date > 0 then datepart(month, dateadd(dd,(begin_date - 2415021),
            convert(datetime, '1900-01-01', 103)))
        end as begin_date_month,
        case when begin_date > 0 then datepart(day, dateadd(dd,(begin_date - 2415021),
            convert(datetime, '1900-01-01', 103)))
        end as begin_date_day,
        case when end_date > 0 then dateadd(dd,(end_date - 2415021),
            convert(date, '1900-01-01', 103))
        end as end_date_scalar,
        case when end_date > 0 then datepart(year, dateadd(dd,(end_date - 2415021),
            convert(datetime, '1900-01-01', 103)))
        end as end_date_year,
        case when end_date > 0 then datepart(month, dateadd(dd,(end_date - 2415021),
            convert(datetime, '1900-01-01', 103)))
        end as end_date_month,
        case when end_date > 0 then datepart(day, dateadd(dd,(end_date - 2415021),
            convert(datetime, '1900-01-01', 103)))
        end as end_date_day,
        case when an.notes is not null then accession_notes||'; '||an.notes else accession_notes end
     as accnNotes,
        case when plant_status='1' then 'true' else 'false' end as rare_flag,
        case when cbd_restrict_flag=1 then 'true' else 'false' end as cbd_restrict_flag,
        a.agent_id,
        a.date_entered,
        a.last_change_agent_id,
        a.last_change_date,
        case when l.access_code = 1 then 'no' else 'yes' end as posttopublic,
        cs.habitat_descr fieldCollectionNote,
        cs.veg_type vegetationType,
        flower_color,
        fruit_color,
        security_risk,
        case when (prov_type_cd = 'G' or cs.geog_lex_id in (2031,
      2326)) then 'true' end as cultivar_flag,
        case when prov_type_cd = 'W' then 'W-wild source'
         when prov_type_cd = 'Z' then 'Z-cultivated from wild source'
         when (prov_type_cd = 'G' or cs.geog_lex_id in (2031, 2326)) then 'G-not wild source'
         when prov_type_cd = 'U' then 'U-unknown'
         end as provenance_type,
        case when prop_hist_cd = 'VA' then 'vegetative reproduction'
         when prop_hist_cd = 'S' then 'sexual reproduction'
         end as propagation_history,
        case when class_use_flag = 1 then 'true' else 'false' end as class_use_flg,
        case when research_flag = 1 then 'true' else 'false' end as research_flg,
        case when distrib_flag = 1 then 'true' else 'false' end as distrib_flg,
        case when ignore_red_dot_flag = 1 then 'true' else 'false' end as ignore_red_dot_flg,
        case when fragrance_flag = 1  then 'true' when fragrance_flag=0 then 'false' end as
     fragrance_flg,
        case when breeding_system = 'M' then 'male'
         when breeding_system = 'F' then 'female'
         when breeding_system = 'B' then 'both'
         when breeding_system = 'Q' then 'dioecious and unknown sex'
         when breeding_system = 'U' then 'unknown'
         end as breeding_system,
        case when flower_jan = 0 then 'No'
         when flower_jan = 1 then 'Some'
         when flower_jan = 2 then 'Many'
         end as flowers_jan,
        case when flower_feb = 0 then 'No'
         when flower_feb = 1 then 'Some'
         when flower_feb = 2 then 'Many'
         end as flowers_feb,
        case when flower_mar = 0 then 'No'
         when flower_mar = 1 then 'Some'
         when flower_mar = 2 then 'Many'
         end as flowers_mar,
        case when flower_apr = 0 then 'No'
         when flower_apr = 1 then 'Some'
         when flower_apr = 2 then 'Many'
         end as flowers_apr,
        case when flower_may = 0 then 'No'
         when flower_may = 1 then 'Some'
         when flower_may = 2 then 'Many'
         end as flowers_may,
        case when flower_jun = 0 then 'No'
         when flower_jun = 1 then 'Some'
         when flower_jun = 2 then 'Many'
         end as flowers_jun,
        case when flower_jul = 0 then 'No'
         when flower_jul = 1 then 'Some'
         when flower_jul = 2 then 'Many'
         end as flowers_jul,
        case when flower_aug = 0 then 'No'
         when flower_aug = 1 then 'Some'
         when flower_aug = 2 then 'Many'
         end as flowers_aug,
        case when flower_sep = 0 then 'No'
         when flower_sep = 1 then 'Some'
         when flower_sep = 2 then 'Many'
         end as flowers_sep,
        case when flower_oct = 0 then 'No'
         when flower_oct = 1 then 'Some'
         when flower_oct = 2 then 'Many'
         end as flowers_oct,
        case when flower_nov = 0 then 'No'
         when flower_nov = 1 then 'Some'
         when flower_nov = 2 then 'Many'
         end as flowers_nov,
        case when flower_dec = 0 then 'No'
         when flower_dec = 1 then 'Some'
         when flower_dec = 2 then 'Many'
         end as flowers_dec,
        case when fruit_jan=1 then 'true' else 'false' end as fruit_jan_flg,
        case when fruit_feb=1 then 'true' else 'false' end as fruit_feb_flg,
        case when fruit_mar=1 then 'true' else 'false' end as fruit_mar_flg,
        case when fruit_apr=1 then 'true' else 'false' end as fruit_apr_flg,
        case when fruit_may=1 then 'true' else 'false' end as fruit_may_flg,
        case when fruit_jun=1 then 'true' else 'false' end as fruit_jun_flg,
        case when fruit_jul=1 then 'true' else 'false' end as fruit_jul_flg,
        case when fruit_aug=1 then 'true' else 'false' end as fruit_aug_flg,
        case when fruit_sep=1 then 'true' else 'false' end as fruit_sep_flg,
        case when fruit_oct=1 then 'true' else 'false' end as fruit_oct_flg,
        case when fruit_nov=1 then 'true' else 'false' end as fruit_nov_flg,
        case when fruit_dec=1 then 'true' else 'false' end as fruit_dec_flg
    FROM
        dbo.accession a
    left outer join accession_notes an on (a.accession_number = an.accession_number)
    left outer join collection_site cs on (a.accession_number = cs.accession_number)
    left outer join dbo.accession_attr aa on (a.accession_number = aa.accession_number)
    left outer join lexicon l on (a.lexicon_id = l.lexicon_id)
    where a.accession_number <> '00.0000'
    order by a.accession_number
    

    Notes

    • Exclude from requestedBy:
      • urn:cspace:botgarden.cspace.berkeley.edu:personauthorities:name(person):item:name(0)'Owner, Database'
    • Exclude from identBy: 
      • urn:cspace:botgarden.cspace.berkeley.edu:personauthorities:name(person):item:name(0)'Owner, Database'
      • urn:cspace:botgarden.cspace.berkeley.edu:null:item:name(null)'null'

    Talend/ETL

    ucbgaccessions 0.5 in Chris's Talend repository, botgarden

    Run Steps

    • Merge into accessions the fragments for:
      • determinations
      • collection sites
      • accession use
    • Change "schema2" and "schema3" to "schema"
    • Import

    MERGE: Taxonomic Determinations and Hybrid Parents (repeating group) - SQL ready to test (Chris), Talend dev-sample data ready (Chris)

    It will be easier to extract Hybrid Parents at the same time as determinations in this case. Talend can handle a single repeating group. Data will be denormalized in SQL via unions and then grouped by Talend. This is complicated by the fact that some identifications come from the accession table; others come from plant_identity.

    Field mapping

    SQL

    -- for all accessions: get plant_identity records for non-hybrids, one row per plant_identity record if no hybrid parent; multiple rows per accession, one with accepted_flag=1
    SELECT
        rtrim(accession_number) objectnumber,
        lexicon_id,
        case when accepted_flag = 0 then 'false'
          else 'true' end as accepted_flg,
        case when aff_lexicon_id = -5 then null
          else aff_lexicon_id end as aff_lexicon,
        'false' hybrid_flg,
        date_entered,
        case when id_made_date = '3000-01-01 00:00:00' then null
          else id_made_date end as id_made_date,
        case when id_made_by > 0 then id_made_by end as id_made_by,
        id_reference,
        id_modifier,
        id_remarks,
        null parent_id,
        null parent_qualifier,
        null fhp_lexicon_id,
        null mhp_lexicon_id,
        case when id_made_date = '3000-01-01 00:00:00' then null
          else datepart(year, id_made_date) end as id_year,
        case when id_made_date = '3000-01-01 00:00:00' then null
          else datepart(month, id_made_date) end as id_month,
        case when id_made_date = '3000-01-01 00:00:00' then null
          else datepart(day, id_made_date) end as id_day
    FROM
        dbo.plant_identity
    where fhp_lexicon_id = -5 and fhp_lexicon_id = -5 and accession_number <> '00.0000'
    union
    --  for all accessions: get plant_identity records for hybrids, usually two rows per plant_identity record if any hybrid parent; multiple rows/pairs per accession with one pair having accepted flag=1.
    SELECT
        rtrim(accession_number) objectnumber,
        case when lexicon_id = -5 then null
          else lexicon_id end as lexicon_id,
        case when accepted_flag = 0 then 'false'
          else 'true' end as accepted_flg,
        case when aff_lexicon_id = -5 then null
          else aff_lexicon_id end as aff_lexicon,
        'true' hybrid_flg,
        date_entered,
        case when id_made_date = '3000-01-01 00:00:00' then null
          else id_made_date end as id_made_date,
        case when id_made_by > 0 then id_made_by end as id_made_by,
        id_reference,
        id_modifier,
        id_remarks,
        fhp_lexicon_id parent_id,
        'female' parent_qualifier,
        fhp_lexicon_id,
        mhp_lexicon_id,
        case when id_made_date = '3000-01-01 00:00:00' then null
          else datepart(year, id_made_date) end as id_year,
        case when id_made_date = '3000-01-01 00:00:00' then null
          else datepart(month, id_made_date) end as id_month,
        case when id_made_date = '3000-01-01 00:00:00' then null
          else datepart(day, id_made_date) end as id_day
    FROM
        dbo.plant_identity
    where fhp_lexicon_id <> -5
    union
    -- get male hybrid parents from plant_identity
    SELECT
        rtrim(accession_number) objectnumber,
        case when lexicon_id = -5 then null
          else lexicon_id end as lexicon_id,
        case when accepted_flag = 0 then 'false'
          else 'true' end as accepted_flg,
        case when aff_lexicon_id = -5 then null
          else aff_lexicon_id end as aff_lexicon,
        'true' hybrid_flg,
        date_entered,
        case when id_made_date = '3000-01-01 00:00:00' then null
          else id_made_date end as id_made_date,
        case when id_made_by > 0 then id_made_by end as id_made_by,
        id_reference,
        id_modifier,
        id_remarks,
        mhp_lexicon_id parent_id,
        'male' parent_qualifier,
        fhp_lexicon_id,
        mhp_lexicon_id,
        case when id_made_date = '3000-01-01 00:00:00' then null
          else datepart(year, id_made_date) end as id_year,
        case when id_made_date = '3000-01-01 00:00:00' then null
          else datepart(month, id_made_date) end as id_month,
        case when id_made_date = '3000-01-01 00:00:00' then null
          else datepart(day, id_made_date) end as id_day
    FROM
        dbo.plant_identity
    where mhp_lexicon_id <> -5
    union
    -- for the 414 where there is no accepted record in plant_identity, get the accession record for non-hybrids, one row per accession record; set accepted flag=1
    SELECT
        rtrim(accession_number) objectnumber,
        lexicon_id,
        'true' accepted_flg, -- assume these are accepted determinations though no record in plant_identity
        case when aff_lexicon_id = -5 then null
          else aff_lexicon_id end as aff_lexicon,
        case when hybrid_flag = 0 then 'false'
          else 'true' end as hybrid_flg,
        date_entered, -- this is accession.date_entered but probably is correct for this set
        null id_made_date,
        null id_made_by,
        null id_reference,
        null id_modifier,
        'determination from SAGE accession record' id_remarks,
        null parent_id,
        null parent_qualifier,
        null fhp_lexicon_id,
        null mhp_lexicon_id,
        null id_year,
        null id_month,
        null id_day
    FROM
        dbo.accession
    where fhp_lexicon_id = -5 and fhp_lexicon_id = -5
        and accession_number in (
        select a.accession_number
           from accession a
           left outer join plant_identity p on (a.accession_number = p.accession_number and p.accepted_flag =1)
           where p.accession_number is null
        )
    union
    -- for the 414 where there is no accepted record in plant_identity, get the accession record for hybrids, usually two rows per accession record; set accepted flag=1
    SELECT
        rtrim(accession_number) objectnumber,
        lexicon_id,
        'true' accepted_flg, -- assume these are accepted determinations though no record in plant_identity
        case when aff_lexicon_id = -5 then null
          else aff_lexicon_id end as aff_lexicon,
        case when hybrid_flag = 0 then 'false'
          else 'true' end as hybrid_flg,
        date_entered, -- this is accession.date_entered but probably is correct for this set
        null id_made_date,
        null id_made_by,
        null id_reference,
        null id_modifier,
        'determination from SAGE accession record' id_remarks,
        fhp_lexicon_id parent_id,
        'female' parent_qualifier,
        fhp_lexicon_id,
        mhp_lexicon_id,
        null id_year,
        null id_month,
        null id_day
    FROM
        dbo.accession
    where fhp_lexicon_id <> -5
        and accession_number in (
         select a.accession_number
           from accession a
           left outer join plant_identity p on (a.accession_number = p.accession_number and p.accepted_flag =1)
           where p.accession_number is null
    )
    union
    -- get male hybrid parents for the query above
    SELECT
        rtrim(accession_number) objectnumber,
        lexicon_id,
        'true' accepted_flg, -- assume these are accepted determinations though no record in plant_identity
        case when aff_lexicon_id = -5 then null
          else aff_lexicon_id end as aff_lexicon,
        case when hybrid_flag = 0 then 'false'
          else 'true' end as hybrid_flg,
        date_entered, -- this is accession.date_entered but probably is correct for this set
        null id_made_date,
        null id_made_by,
        null id_reference,
        null id_modifier,
        'determination from SAGE accession record' id_remarks,
        mhp_lexicon_id parent_id,
        'male' parent_qualifier,
        fhp_lexicon_id,
        mhp_lexicon_id,
        null id_year,
        null id_month,
        null id_day
    FROM
        dbo.accession
    where mhp_lexicon_id <> -5
        and accession_number in (
         select a.accession_number
           from accession a
           left outer join plant_identity p on (a.accession_number = p.accession_number and p.accepted_flag =1)
           where p.accession_number is null
    )
    order by objectnumber, accepted_flg desc, date_entered desc, parent_qualifier

    Notes

    So this is now determinations plus hybrids in one big denormalized set.  Psuedo-SQL:

    1 for all accessions (including the 418 that have no accepted record in plant_identity): get plant_identity records where FHP = -5 and MHP = -5 (one row per plant_identity record if no hybrid parent; multiple rows per accession, usually one with accepted_flag=1; order by accession_number, accepted_flag desc)
    union
    2 for all accessions (including the 418 that have no accepted record in plant_identity): get plant_identity records where FHP <> -5 or MHP <> -5 (usually two rows per plant_identity record if any hybrid parent; multiple rows/pairs per accession with one pair having accepted flag =1.  Order by accession_number, accepted_flag desc, parent_type female before male)
    union
    3. for the 418 accessions that have no accepted record in plant_identity, get the accession record where FHP = -5 and MHP = -5 (one row per accession record if no hybrid parent; set accepted flag=1? or not; maybe indicate this is a record from accession table; no accepted record in plant_identity)
    union
    4. for the 418 accessions that have no accepted record in plant_identity, get the accession records where FHP <> -5 or MHP <> -5 (usually two rows per accession record if any hybrid parent; order by accession_number, parent_type female before male; set accepted flag=1)

    There are 12 non-preferred terms used throughout this set of 51K determinations. These need a different refname than is stored in the database. Rather than code that into the Talend job, I will just edit the merge file by hand to edit in the non-preferred term's refname.

    Talend/ETL

    Chris ucbgdeterminations 0.5 in Chris's Talend repository botgarden

    MERGE: Collection Sites (not repeating but easier to extract separately) - SQL dev (Chris), Talend dev-sample data ready (Chris)

    Including significant cleanup of geography data from SAGE.

    Field mapping

    These are the rules we are using for cleaning up the geographic data and collection sites in SAGE:

    For accessions from years 2000 and later:
    • If Collection Site = 'Earth' or Hort, and Accession Notes has parentheses
      • Populate the new Taxonomic Range field (in Collection Object, Locality Info) with the Accession Notes info in parentheses (Taxonomic Range data come from accession table, parenthetical text in accession.accession_notes)
      • Field collection place is blank
      • Field collection county, Field collection state, Field collection country are blank
    • If Collection Site = some known non-Earth, non-Hort location
      • Taxonomic range information is blank
      • Field collection place gets the Collection Site value (which will be controlled by the Place Authority) (Field Collection Place data come from collection_site table, via geog_lexicon.geog_term)
      • Field collection county, state, and country will be blank in the cataloging record but will be available via the Place Authority by importing hierarchical Place Authority records.
    For accessions from years 1999 and earlier:
    • If any of the county, state, country fields has parentheses, use data from
      • Taxonomic Range data come from Original Accession Geography, the accession_loc table (accession_loc.country_id, accession_loc.state_id, accession_loc.county_id). Strip parentheses, concatenate, and put into the new Taxonomic range field.
      • Field collection place is blank
      • Field collection county, Field collection state, Field collection country are blank
    • If all three do not have parentheses, and collection site is not Earth or Hort., use data from Collection Site
      • Taxonomic range field is blank
      • Field Collection Place data come from collection_site table (geog_lexicon.geog_term) Collection Site
      • Field collection county, state, and country will be blank in the cataloging record but will be available via the Place Authority by importing hierarchical Place Authority records.
    • If all three do not have parentheses; collection site = Earth or Hort.; and there are some legitimate non-range values in country, state, or county
      • Taxonomic range field is blank
      • Field collection place is blank
      • Field collection county, Field collection state, Field collection country get the values from Original Accession Geography (accession_loc)

    Note: If collection site = 'hort.' or if country = 'HORT.' then the cultivated checkbox should be checked, and provenance drop down should be "not wild source".

    SQL

    SELECT
        rtrim(cs.accession_number) objectnumber,
        case when lat_deg is not null then
         convert(varchar,convert(int, lat_deg))||' '||convert(varchar,convert(int,
      lat_min))||' '||convert(varchar,convert(int, lat_sec))||lat_dir
        end as vlatitude,
        case when long_deg is not null then
          convert(varchar,convert(int, long_deg))||' '||convert(varchar,convert(int,
      long_min))||' '||convert(varchar,convert(int, long_sec))||long_dir
        end as vlongitude,
        latitude declatitude,
        longitude declongitude,
        case when UTM_north is not null then
          convert(varchar,UTM_east)||' mE, '||convert(varchar,UTM_north)||' mN'
        end as vcoords,
        elevation velevation,
        lower_elev minelevation,
        upper_elev maxelevation,
        case when lower_elev is not null then 'meters' end as unitselevation,
        case
          when geog_detail is not null and geog_distance is null then geog_detail
          when geog_detail is null and geog_distance is not null then geog_distance
          when geog_detail is not null and geog_distance is not null then geog_detail||';
     '||geog_distance
        end as vlocation,
        habitat_descr,
        veg_type,
        use_range_flag,
        case
          when (country_name not like '%(%' and state_name not like '%(%)%' and county_name not like
     '%(%)%' and cs.geog_lex_id <> 0) then cs.geog_lex_id
          when (cs.geog_lex_id in (2031, 2326)) then 2326
        end as geog_lex,
        case when gl.geog_term = 'hort' then 'hort.'
           else gl.geog_term
        end as geog_term,
        case
          when (country_name like '%(%' or state_name like '%(%)%' or county_name like '%(%)%')
            then rtrim(county_name||', '||state_name||', '||country_name)
        end as taxonomic_range,
        case
          when (country_name not like '%(%' and state_name not like '%(%)%' and county_name not like
     '%(%)%' and cs.geog_lex_id in (0) and (al.country_id not in (0, 442,
      562) or al.state_id <> 0 or al.county_id <> 0)) then country_name
        end as new_country,
        case
          when (country_name not like '%(%' and state_name not like '%(%)%' and county_name not like
     '%(%)%' and cs.geog_lex_id in (0) and (al.country_id not in (0, 442,
      562) or al.state_id <> 0 or al.county_id <> 0) and al.state_id <> 0) then state_name
        end as new_state,
        case
          when (country_name not like '%(%' and state_name not like '%(%)%' and county_name not like
     '%(%)%' and (cs.geog_lex_id < 5953) and (al.country_id not in (0, 442,
      562) or al.state_id <> 0 or al.county_id <> 0) and al.county_id not in (0, 14) ) then county_name
        end as new_county
    FROM
        collection_site cs
    left outer join dbo.accession_loc al on (al.accession_number = cs.accession_number)
    left outer join country_code cc on (al.country_id = cc.country_id)
    left outer join state_code sc on (al.state_id = sc.state_id)
    left outer join county_code c on (al.county_id = c.county_id)
    left outer join geog_lexicon gl on (cs.geog_lex_id = gl.geog_lex_id)
    where substring(cs.accession_number, 3, 1) = '.' and cs.accession_number <> '00.0000'
    union
    SELECT
        rtrim(a.accession_number) objectnumber,
        case when lat_deg is not null then
         convert(varchar,convert(int, lat_deg))||' '||convert(varchar,convert(int,
      lat_min))||' '||convert(varchar,convert(int, lat_sec))||lat_dir
        end as vlatitude,
        case when long_deg is not null then
          convert(varchar,convert(int, long_deg))||' '||convert(varchar,convert(int,
      long_min))||' '||convert(varchar,convert(int, long_sec))||long_dir
        end as vlongitude,
        latitude declatitude,
        longitude declongitude,
        case when UTM_north is not null then
          convert(varchar,UTM_east)||' mE, '||convert(varchar,UTM_north)||' mN'
        end as vcoords,
        elevation velevation,
        lower_elev minelevation,
        upper_elev maxelevation,
        case when lower_elev is not null then 'meters' end as unitselevation,
        case
          when geog_detail is not null and geog_distance is null then geog_detail
          when geog_detail is null and geog_distance is not null then geog_distance
          when geog_detail is not null and geog_distance is not null then geog_detail||';
     '||geog_distance
        end as vlocation,
        habitat_descr,
        veg_type,
        use_range_flag,
        case
          when (cs.geog_lex_id not in (0, 2031) ) then cs.geog_lex_id
          when (cs.geog_lex_id = 2031 ) then 2326
        end as geog_lex,
        case when gl.geog_term = 'hort' then 'hort.'
           else gl.geog_term
        end as geog_term,
        case
          when (cs.geog_lex_id in (0, 2031, 2326) and accession_notes like '%(%)%')
            then substring(
            substring(
                    a.accession_notes,
                    (patindex('%(%)%', a.accession_notes)) + 1,
                    char_length(a.accession_notes)),
            1,
            (patindex('%)%', substring(
                    a.accession_notes,
                    (patindex('%(%)%', a.accession_notes)) + 1,
                    char_length(a.accession_notes)))) - 1
        )
        end as taxonomic_range,
        case
          when (country_name not like '%(%' and state_name not like '%(%)%' and county_name not like
     '%(%)%' and cs.geog_lex_id =0 and (al.country_id not in (0, 442,
      562) or al.state_id <> 0 or al.county_id <> 0)) then country_name
        end as new_country,
        case
          when (country_name not like '%(%' and state_name not like '%(%)%' and county_name not like
     '%(%)%' and cs.geog_lex_id =0 and (al.country_id not in (0, 442,
      562) or al.state_id <> 0 or al.county_id <> 0) and al.state_id <> 0) then state_name
        end as new_state,
        case
          when (county_name not like '%(%)%' and county_name is not null and al.county_id <> 0 and
     cs.geog_lex_id < 5953) then county_name
        end as new_county
    FROM
        dbo.accession a
    left outer join accession_loc al on (a.accession_number = al.accession_number)
    left outer join collection_site cs on (a.accession_number = cs.accession_number)
    left outer join geog_lexicon gl on (cs.geog_lex_id = gl.geog_lex_id)
    left outer join country_code cc on (al.country_id = cc.country_id)
    left outer join state_code sc on (al.state_id = sc.state_id)
    left outer join county_code c on (al.county_id = c.county_id)
    where substring(a.accession_number, 5, 1) = '.'
    order by objectnumber
    

    Notes

    Sites can have hort and range..

    County, state, country only used in a small handful of cases where they provided useful information not in collection site or taxonomic range.  Concatenated onto verbatim locality field.

    Talend/ETL

    ucbgcollectionsite 0.6 in Chris's Talend repository botgarden

    MERGE: Use of Accessions (repeating group) - SQL ready (Chris), Talend dev-sample data ready (Chris)

    Research projects and classes

    Field mapping

    SQL

    SELECT
        rtrim(accession_number) accession_number,
        'research' usetype,
        'researchproject'||convert(varchar(20), rb.project_id) conceptShortID,
        rp.project_name conceptName,
        request_date,
        filled_date,
        material_type,
        notes
    FROM
        dbo.research_plant rb
    join research_project rp on (rp.project_id = rb.project_id)
    union
    SELECT
        rtrim(accession_number) accession_number,
        'class' usetype,
        'class'||convert(varchar(20), cp.course_id) conceptShortID,
        cu.course_title conceptName,
        cp.last_change_date request_date,
        null filled_date,
        null material_type,
        concept notes
    FROM
        dbo.class_plant cp
    join class_use cu on (cp.course_id=cu.course_id)
    order by accession_number, request_date

    Notes

    URN format for materialType: <materialType>urn:cspace:botgarden.cspace.berkeley.edu:vocabularies:name(materialtype):item:name(bulb)'bulb'</materialType>

    Talend/ETL

    ucbgaccessionuse 0.1 in Chris's Talend repository botgarden

    Procedures

    Plant Locations (LMI procedure, including current location and history/version) and Relationships to Objects - SQL ready (Chris), Talend (Yuteh)

    See attached accessions map, current_location and accession_history tables

    Field mapping

    SQL

    Initially, this SQL just gets the current location records, plant_location table.

    SELECT
        null record_id,
        rtrim(accession_number) accession_number,
        'plant_location' datasource,
        action_date,
        case when action_code = 0 then 'Dead'
         when action_code = 1 then 'Planted Out'
         when action_code = 2 then 'Moved'
         when action_code = 3 then 'Label Request'
         when action_code = 4 then 'Theft'
         when action_code = 5 then 'Other'
         when action_code = 6 then 'Revived'
        end as reasonForMove,
        null previous_location,
        'garden'||convert(varchar(5), garden_loc_id) current_location,
        case when label_req_flag = 1 then 'true' else 'false' end as label_request,
        label_size,
        stand_type,
        label_count,
        comments,
        entered_staff_id,
        date_entered,
        last_change_staff_id,
        last_change_date
    FROM
        dbo.plant_location
    where accession_number <> '00.0000'
    order by accession_number, datasource desc, action_date desc

    Notes

    Accession History records are being loaded as versioned LMI records.  We still need to know how to do that.

    Talend/ETL

    Uses output from ucbgaccessions Talend job that contains CSIDs and Object Numbers. Rename most recent output file to ucbg-accessions-ids.txt

    Also uses a file sage_createupdateby_lookup.txt that contains a few user email addresses to decode the staff ID for the createdBy and updateBy info

    ucbgplantlocations 0.1

    Run steps

    • Movements: Change "schema2" to "schema"
    • Movements: Import
    • Move-to-object relationships: Import
    • Object-to-move relationships: Import

    Vouchers and Voucher-Object relationships - SQL ready (Chris), Talend dev (Lam)

    Field mapping

    Using Loan Out for Voucher

    SQL

    SELECT
        voucher_id, -- loanOutNumber
        accession_number, -- for relationship to Accession
        voucher_number voucherInstitution, -- get refname for borrower
        case when h_w_code='h' then 'Horticultural'
             when h_w_code='w' then 'Wild'
        end as hortWild, -- decode to h Horticultural or w Wild
        vouchered_date,
        case when voucher_early_date>2415021 and voucher_early_date<2556233 and vouchered_date not like ('%.%')
              then dateadd(dd,(voucher_early_date - 2415021), convert(date, '1900-01-01', 103))
        end as voucherDate, -- loanOutDate
        voucher_early_date, -- ??
        voucher_late_date, -- ??
        case when vouchered_by <> 0 then vouchered_by end as vouchered_by,
        case when garden_loc_id <> 0 then garden_loc_id end as garden_loc_id, -- get refname for garden loc
        label_count numSheets, -- numLent
        smasch_flag, -- skip
        case when flowering_flag=1 then 'true' else 'false' end as flowering, -- case 1=true
        case when fruiting_flag=1 then 'true' else 'false' end as fruiting, -- case 1=true
        case when sterile_flag=1 then 'true' else 'false' end as sterile, -- case 1=true
        entered_staff_id createdBy,
        date_entered createdAt,
        last_change_staff_id updatedBy,
        last_change_date updatedAt
    FROM
        dbo.voucher

    Notes

    Still need to get UCJEPS accession number for some number of vouchers

    • Uses ID output from job that creates garden locations vocabulary.  Make sure file is renamed to ucbg-gardenloc-ids.txt
    • Uses ID output from job that creates accessions.  Make sure that file is renamed to ucbg-accessions-ids.txt

    Talend/ETL

    ucbgvoucher 0.1

    Run steps

    • Vouchers: Change "schema2" and "schema3" to "schema"
    • Vouchers: Import
    • Voucher-to-object relationships: Import
    • Object-to-voucher relationships: Import

    Propagation  SQL ready (Chris), REST Import development (John)

    Field mapping

    SQL

    Control lists:

    • Pot Type: select pot_type from pot_type_code order by pot_type;
    • Pot Size: select pot_size from pot_size_code order by pot_size;
    • Treatment Type: select treatment_type_name from treatment_type order by treatment_type_name;

    Notes

    See Botanical propagation data analysis

    Prop Type:

    • lookup table: prop_type_code
    • propagation.prop_type = pro_type_code.prop_type
    • 21 values in prop_type_code

    Reason for Prop: propagation.purpose

    • database has a rule that limits the values to: 'class use', 'research', 'garden collection',
      'distribution', 'conservation'
    • 5 distinct values in propagation.purpose

    Extra Seeds: propagation.extra_seed_flag

    • boolean (0,1)

    Activity Type:

    • lookup table: activity_type_code
    • activity_type_code.activity_type where growth_history.activity_type_id = activity_type_code.activity_type_id
    • 24 values in activity_type_code

    Medium:

    • lookup table: medium
    • growth_history.medium_name = medium.medium_name
    • 1302 values in medium

    Pot Type:d

    • lookup table: pot_type
    • growth_history.pot_type = pot_type_code.pot_type
    • 35 values in pot_type_code

    Pot Size:

    • lookup table: pot_size_code
    • pot_size_code.pot_size where growth_history.pot_size_id = pot_size_code.pot_size_id
    • 69 values in pot_size_code

    Treatment Type:

    • lookup table: treatment_type
    • treatment_type.treatment_type_name where treatment.treatment_type_id = treatment_type.treatment_type_id
    • 91 values in treatment_type

    Concentration: treatment.concentration

    • varchar(10) field
    • no rules or constraints on this field
    • 80 distinct values in treatment.concentration

    Talend/ETL

    MERGE: Propagation Activities (repeating group) SQL ready (Chris), REST import (John)

    Field mapping

    SQL

    Notes

    See Botanical propagation data analysis

    Relationships between Accessions and Propagations; REST import (John)

    Field mapping

    SQL

    Notes

    Talend/ETL

    Talend/ETL

    Distributions and Distribution-Accession relationships (using Object Exit procedure) - SQL dev (Chris), Talend dev (Chris), Data botgarden-dev loaded (Chris)

    Field mapping

    SQL

    SELECT
        rtrim(accession_number) accession_number, -- for relations
        e.org_id, -- Exchange Org, get refname
        o.org_name,
        convert(date, exchange_date) exchangeDate, -- for display date
        exchange_date, -- for scalar dates
        datepart(year, exchange_date) as exchange_date_year,
        datepart(month, exchange_date) as exchange_date_month,
        datepart(day, exchange_date) as exchange_date_day,
        plant_count,
        e.entered_staff_id,
        e.date_entered,
        exchange_comments
    FROM
        dbo.exchange e,
        organization o
        where e.org_id=o.org_id

    Notes

    • Uses ID output from job that creates accessions.  Make sure that file is renamed to ucbg-accessions-ids.txt

    Talend/ETL

    ucbgdistributions 0.1

    Run steps

    • Distributions: Change "schema2" to "schema"
    • Distributions: Import
    • Distribution-to-object relationships: Import
    • Object-to-distribution relationships: Import

    Pot Tags (Volunteer Labels) - Lam (SQL), Chris (Talend)

    Field mapping

    SQL

    select
        vl.label_id as legacyId,
        vl.accession_number as accessionNumber,
        vl.formatted_taxon as taxonName,
        vl.common_name as commonName,
        vl.label_data as labelData,
        vl.family as family,
        vl.country_name as locale,
        vl.quantity as numberOfLabels,
        case when vl.label_req_flag = 0 then 'no'
             when vl.label_req_flag = 1 then 'yes'
        end as printLabels
    from
        volunteer_label vl
    

    Notes

    Talend/ETL

    ucbgpottags 0.1 creates pot tag records and relationship records to collection objects

    • No labels