UCBG-CollectionSpace data mapping, v2.4

With this iteration, we are targeting the data migration mappings for the migration of SAGE data into a CollectionSpace instance, based on the version 2.4 release of CollectionSpace. This launch is intended to support the full migration of the UC Botanical Garden SAGE system into a production CollectionSpace system. Changes from the previous mapping are in blue.

	File	Modified

You are not logged in. Any changes you make will be marked as anonymous. You may want to Log In if you already have an account.

No files shared here yet.

Drag and drop to upload or browse for files

Authorities

Person (default person vocabulary) - Lam (SQL ready), Talend dev (Yuteh)

See attached agents data map.

Tentatively, we will have one person vocabulary, mapping the agents from SAGE that are also in the persons table.

There might be some need to produce merge parts, for instance if we need to create, preferred and non-preferred terms for individual records.

Field mapping

SQL

select
    p.person_id as short_identifier,
    an.name as display_name,
    p.birth_date as birth_date,
    p.death_date as death_date,
    datepart(year, p.birth_date) as birth_date_year,
    datepart(month, p.birth_date) as birth_date_month,
    datepart(day, p.birth_date) as birth_date_day,
    case when p.birth_date is null then null
        else str_replace(convert(char(10), p.birth_date, 102), '.', '-')
            + 'T00:00:00Z'
    end as birth_date_Esclr,
    case when p.birth_date is null then null
        else str_replace(convert(char(10), p.birth_date, 102), '.', '-')
            + 'T23:59:59Z'
    end as birth_date_Lsclr,
    datepart(year, p.death_date) as death_date_year,
    datepart(month, p.death_date) as death_date_month,
    datepart(day, p.death_date) as death_date_day,
    case when p.death_date is null then null
        else str_replace(convert(char(10), p.death_date, 102), '.', '-')
            + 'T00:00:00Z'
    end as death_date_Esclr,
    case when p.death_date is null then null
        else str_replace(convert(char(10), p.death_date, 102), '.', '-')
            + 'T23:59:59Z'
    end as death_date_Lsclr,
    p.person_remark as name_note
from
    person p,
    agent_name an
where p.person_id = an.agent_id
and an.name_type = 'pref'
and p.person_id > 0

Notes

Talend/ETL

bgimportperson1 0.2: Revising earlier talend job (based on 1.8) to change refname format, remove ampersand handling, and include collectionspace_core:uri element

New Talend job by Yuteh: BG_person 0.1

MERGE: Person Terms (repeating group) - Lam (SQL ready), Talend job (yuteh)

E.g., Different name forms (preferred, alias)

Field mapping

SQL

-- kewabbr as preferred names when they are duplicates of preferred name
select
    an.agent_id as short_identifier,
    an.name as display_name,
    an.name as term_name,
    an.name_type as term_type,
    'accepted' as term_status,
    'English' as language,
    an_source.name as source,
    p.data_source_id as source_id,
    p.prefix as title,
    p.first_name as forename,
    p.middle_name as middle_name,
    p.last_name as surname,
    p.suffix as name_addition,
    case
        when p.first_name is null and p.middle_name is null
            then substring(p.last_name, 1, 1)
        when p.first_name is not null and p.middle_name is null
            then substring(p.first_name, 1, 1) + substring(p.last_name, 1, 1)
        when p.first_name is null and p.middle_name is not null
            then substring(p.middle_name, 1, 1) + substring(p.last_name, 1, 1)
        else substring(p.first_name, 1, 1) +
            substring(p.middle_name, 1, 1) +
            substring(p.last_name, 1, 1)
    end as initials,
    1 as preferred_fg
from
    agent_name an,
    agent_name an2,
    person p,
    agent_name an_source,
    agent_name_role anr
where p.person_id = an.agent_id
and an.agent_id = an2.agent_id
and an2.name_type = 'pref'
and an.name_type = 'kewabbr'
and an.name = an2.name
and p.data_source_id = an_source.agent_id
and an_source.agent_name_id = anr.agent_name_id
and an_source.name_type = anr.name_type
and an_source.name_type = 'pref'
and anr.role_type = 'datasrc'
and p.person_id > 0
union
-- preferred names except those that are duplicates with a kewabbr
select
    an.agent_id as short_identifier,
    an.name as display_name,
    an.name as term_name,
    an.name_type as term_type,
    'accepted' as term_status,
    'English' as language,
    an_source.name as source,
    p.data_source_id as source_id,
    p.prefix as title,
    p.first_name as forename,
    p.middle_name as middle_name,
    p.last_name as surname,
    p.suffix as name_addition,
    case
        when p.first_name is null and p.middle_name is null
            then substring(p.last_name, 1, 1)
        when p.first_name is not null and p.middle_name is null
            then substring(p.first_name, 1, 1) + substring(p.last_name, 1, 1)
        when p.first_name is null and p.middle_name is not null
            then substring(p.middle_name, 1, 1) + substring(p.last_name, 1, 1)
        else substring(p.first_name, 1, 1) +
            substring(p.middle_name, 1, 1) +
            substring(p.last_name, 1, 1)
    end as initials,
    1 as preferred_fg
from
    agent_name an,
    person p,
    agent_name an_source,
    agent_name_role anr
where p.person_id = an.agent_id
and an.name_type = 'pref'
and p.data_source_id = an_source.agent_id
and an_source.agent_name_id = anr.agent_name_id
and an_source.name_type = anr.name_type
and an_source.name_type = 'pref'
and anr.role_type = 'datasrc'
and p.person_id > 0
and an.agent_name_id not in (
    select an_pref.agent_name_id
    from person p, agent_name an_pref, agent_name an_kew
    where p.person_id = an_pref.agent_id
    and an_pref.agent_id = an_kew.agent_id
    and an_pref.name_type = 'pref'
    and an_kew.name_type = 'kewabbr'
    and an_pref.name = an_kew.name)
union
-- alternate names that are not duplicates of pref name
select
    an.agent_id as short_identifier,
    an.name as display_name,
    an.name as term_name,
    an.name_type as term_type,
    'accepted' as term_status,
    'English' as language,
    an_source.name as source,
    p.data_source_id as source_id,
    p.prefix as title,
    p.first_name as forename,
    p.middle_name as middle_name,
    p.last_name as surname,
    p.suffix as name_addition,
    case
        when p.first_name is null and p.middle_name is null
            then substring(p.last_name, 1, 1)
        when p.first_name is not null and p.middle_name is null
            then substring(p.first_name, 1, 1) + substring(p.last_name, 1, 1)
        when p.first_name is null and p.middle_name is not null
            then substring(p.middle_name, 1, 1) + substring(p.last_name, 1, 1)
        else substring(p.first_name, 1, 1) +
            substring(p.middle_name, 1, 1) +
            substring(p.last_name, 1, 1)
    end as initials,
    0 as preferred_fg
from
    agent_name an,
    agent_name an2,
    person p,
    agent_name an_source,
    agent_name_role anr
where p.person_id = an.agent_id
and an.agent_id = an2.agent_id
and an2.name_type = 'pref'
and an.name_type != 'pref'
and an.name != an2.name
and p.data_source_id = an_source.agent_id
and an_source.agent_name_id = anr.agent_name_id
and an_source.name_type = anr.name_type
and an_source.name_type = 'pref'
and anr.role_type = 'datasrc'
and p.person_id > 0
order by short_identifier, preferred_fg desc;

Notes

Talend/ETL

New Talend job by Yuteh: BG_personTerm 0.1

Organization (default Org vocabulary will be UCBG institutions) - Lam (SQL ready), Talend dev (yuteh)

See attached agents data map

We will have the following three organization vocabularies:

organizations (institutions from SAGE organizations plus the voucher institutions)
collectors (coming from the separate SAGE table)
groups (mostly KEW names for nomenclature)

Field mapping

SQL

select
   convert(varchar(5), org_id) as short_identifier,
   org_name as display_name,
   org_remark as notes
from organization
union
select distinct
    voucher_number as short_identifier,
    case when voucher_name is null then voucher_number||' (voucher institution)'
      else voucher_name end as display_name,
    null as notes
from voucher_number_code

Notes

Talend/ETL

New Talend job by Yuteh: BG_org_org 0.1

MERGE: Organization Terms (repeating group) - Lam (SQL ready), Talend dev (Yuteh)

Field mapping

SQL

select
    convert(varchar(5), o.org_id) as short_identifier,
    an.name as display_name,
    an.name_type as term_type,
    'accepted' as term_status,
    an.name as term_name,
    'English' as term_language,
    an_source.name as term_source,
    an_source.agent_id as term_source_id,
    an.name as main_body_name,
    o.dept_unit as additions_to_name,
    1 as preferred_fg
from
    organization o,
    agent_name an,
    agent_name an_source,
    agent_name_role anr
where o.org_id = an.agent_id
and an.name_type = 'pref'
and o.data_source_id = an_source.agent_id
and an_source.agent_name_id = anr.agent_name_id
and an_source.name_type = anr.name_type
and an_source.name_type = 'pref'
and anr.role_type = 'datasrc'
union
select
    convert(varchar(5), o.org_id) as short_identifier,
    an.name as display_name,
    an.name_type as term_type,
    'accepted' as term_status,
    an.name as term_name,
    'English' as term_language,
    an_source.name as term_source,
    an_source.agent_id as term_source_id,
    an.name as main_body_name,
    o.dept_unit as additions_to_name,
    0 as preferred_fg
from
    organization o,
    agent_name an,
    agent_name an2,
    agent_name an_source,
    agent_name_role anr
where o.org_id = an.agent_id
and an.agent_id = an2.agent_id
and an.name_type != 'pref'
and an2.name_type = 'pref'
and an.name != an2.name
and o.data_source_id = an_source.agent_id
and an_source.agent_name_id = anr.agent_name_id
and an_source.name_type = anr.name_type
and an_source.name_type = 'pref'
and anr.role_type = 'datasrc'
union
select
    voucher_number as short_identifier,
    case when voucher_name is null then voucher_number||' (voucher institution)'
      else voucher_name end as display_name,
    'pref' as term_type,
    'accepted' as term_status,
    case when voucher_name is null then voucher_number||' (voucher institution)'
      else voucher_name end as term_name,
    'English' as term_language,
    null as term_source,
    null as term_source_id,
    case when voucher_name is null then voucher_number||' (voucher institution)'
      else voucher_name end as main_body_name,
    null as additions_to_name,
    1 as preferred_fg
from voucher_number_code
order by short_identifier, preferred_fg desc

Notes

Talend/ETL

New Talend job by Yuteh: BG_org_orgTerm 0.1

MERGE: Organization Associated Persons (repeating group) - Lam (SQL ready), Talend dev (Yuteh)

Field mapping

SQL

-- need to verify person_id = 0 records.
select
    op.org_id as legacy_org_id,
    op.person_id as legacy_affperson_id,
    an.name as affiliated_person,
    'urn:cspace:botgarden.cspace.berkeley.edu:personauthorities:name(person):item:name('
    + convert(varchar(10), op.person_id) + ')'''+ an.name + ''''
    as affiliated_person_refname,
    'member' as affiliated_person_type
from
    org_person op,
    agent_name an
where op.person_id = an.agent_id
and an.name_type = 'pref'

Notes

Talend/ETL

New Talend job by Yuteh: BG_org_orgAssocPsn 0.1

Collectors (Organization vocabulary) - Lam (SQL ready), Talend dev (Yuteh)

Note: Only one name in SAGE so no repeating groups

Field mapping

SQL

select
    cc.collector_id as short_identifier,
    cc.collector_name as display_name,
    cc.collector_name as term_name,
    'English' as term_language,
    'pref' as term_type,
    'accepted' as term_status,
    cc.collector_name as main_body_name,
    1 as preferred_fg
from collector_code cc
where collector_id != 1
order by cc.collector_name

Notes

Talend/ETL

New Talend job by Yuteh: BG_org_collector 0.1

Groups (Organization Vocabulary) - Lam (SQL ready), Talend dev (Yuteh)

Field mapping

SQL

select
    group_id as short_identifier,
    group_name as display_name,
    group_remark as notes
from
    group_master
order by group_id

Notes

Talend/ETL

New Talend job by Yuteh: BG_org_group 0.1

MERGE: Groups Terms (repeating group) - Lam (SQL ready), Talend dev (Yuteh)

Field mapping

SQL

-- kewabbr group names that are duplicates of pref group names
select
    gm.group_id as short_identifier,
    an.name as display_name,
    an.name_type as term_type,
    'accepted' as term_status,
    an.name as term_name,
    'English' as term_language,
    an.name as term_source,
    an.agent_id as term_source_id,
    an.name as main_body_name,
    1 as preferred_fg
from
    group_master gm,
    agent_name an,
    agent_name an2,
    agent_name an_source,
    agent_name_role anr
where gm.group_id = an.agent_id
and an.agent_id = an2.agent_id
and an.name_type = 'kewabbr'
and an2.name_type = 'pref'
and an.name = an2.name
and gm.data_source_id = an_source.agent_id
and an_source.agent_name_id = anr.agent_name_id
and an_source.name_type = anr.name_type
and an_source.name_type= 'pref'
and anr.role_type = 'datasrc'
union
-- pref group names except for pref names that are duplicates of kewabbr
select
    gm.group_id as short_identifier,
    an.name as display_name,
    an.name_type as term_type,
    'accepted' as term_status,
    an.name as term_name,
    'English' as term_language,
    an.name as term_source,
    an.agent_id as term_source_id,
    an.name as main_body_name,
    1 as preferred_fg
from
    group_master gm,
    agent_name an,
    agent_name an_source,
    agent_name_role anr
where gm.group_id = an.agent_id
and gm.data_source_id = an_source.agent_id
and an_source.agent_name_id = anr.agent_name_id
and an_source.name_type = anr.name_type
and an_source.name_type= 'pref'
and anr.role_type = 'datasrc'
and an.name_type = 'pref'
and an.agent_name_id not in (
    select an_pref.agent_name_id
    from group_master gm, agent_name an_pref, agent_name an_kew
    where gm.group_id = an_pref.agent_id
    and an_pref.agent_id = an_kew.agent_id
    and an_pref.name_type = 'pref'
    and an_kew.name_type = 'kewabbr'
    and an_pref.name = an_kew.name)
union
-- non-pref group names that are not duplicates of pref name
select
    gm.group_id as short_identifier,
    an.name as display_name,
    an.name_type as term_type,
    'accepted' as term_status,
    an.name as term_name,
    'English' as term_language,
    an.name as term_source,
    an.agent_id as term_source_id,
    an.name as main_body_name,
    0 as preferred_fg
from
    group_master gm,
    agent_name an,
    agent_name an2,
    agent_name an_source,
    agent_name_role anr
where gm.group_id = an.agent_id
and an.agent_id = an2.agent_id
and an.name_type != 'pref'
and an2.name_type = 'pref'
and an.name != an2.name
and gm.data_source_id = an_source.agent_id
and an_source.agent_name_id = anr.agent_name_id
and an_source.name_type = anr.name_type
and an_source.name_type = 'pref'
and anr.role_type = 'datasrc'
order by short_identifier, preferred_fg desc;

Notes

Talend/ETL

New Talend job by Yuteh: BG_org_groupTerm 0.1

MERGE: Groups Associated Persons (repeating group) - Lam (SQL ready), Talend dev (Yuteh)

Field mapping

SQL

--need to verify person_id = 0 records
select
    gp.group_id as legacy_org_id,
    gp.person_id as legacy_affperson_id,
    an.name as affiliated_person,
    'urn:cspace:botgarden.cspace.berkeley.edu:personauthorities:name(person):item:name('
        + convert(varchar(10), gp.person_id) + ')''' +  an.name + ''''
    as affiliated_person_refname,
    'member' as affiliated_person_type
from
    group_person gp,
    agent_name an
where gp.person_id = an.agent_id
and an.name_type = 'pref'

Notes

Talend/ETL

New Talend job by Yuteh: BG_org_groupAssocPsn 0.1

Taxonomy (default taxonomy vocabulary for verified, and possibly unverified, names) - Lam (SQL ready), Yuteh (Talend)

See attached taxonomy map

Note: This query requires a table be refreshed on SAGE, cspace_lexicon_cv.

Field mapping

SQL

-- records that do not have scope notes and are in taxon_name table
select
    convert(varchar(10), l.lexicon_id) as short_identifier,
    l.term as display_name,
    case
        when l.node_type in (
            'cl', 'div', 'f', 'fam', 'gen', 'or', 'sect', 'sp',
            'subcl', 'subgen', 'subsect', 'subsp', 'var')
        then l.node_type
        else null
    end as taxon_rank,
    fam.term as family,
    case when fam.term is null or fam.term = ''
        then null
        else
            'urn:cspace:botgarden.cspace.berkeley.edu:taxonomyauthority:name(taxon):item:name('
            + convert(varchar(10), fam.ancestor_id) + ')''' + fam.term + ''''
    end as family_refname,
    tn.notho_taxon as hybrid_fg,
    case
        when tn.access_code is not null then
            case
                when tn.access_code = 0 then 'Unrestricted'
                when tn.access_code = 1 then 'Restricted'
                when tn.access_code = 2 then 'Dead'
            end
        else
            case
                when l.access_code = 0 then 'Unrestricted'
                when l.access_code = 1 then 'Restricted'
                when l.access_code = 2 then 'Dead'
            end
    end as access_restrictions,
    null as taxon_note                                                          
from
    lexicon l,
    taxon_name tn,
    (select tc.descendant_id, tc.ancestor_id, l.term
        from trans_closure tc, lexicon l
        where tc.ancestor_id = l.lexicon_id
        and l.node_type = 'fam') fam
where l.lexicon_id *= tn.taxon_id
and l.lexicon_id *= fam.descendant_id
and l.node_type not in ('common', 'bioname', 'na', 'root', 'cv')
and l.lexicon_id not in (select lexicon_id from scope_notes)
and l.lexicon_id not in (
    select lexicon_id from lexicon_term_relation
    where relation_type in ('TRV', 'TNA'))
and lexicon_id not in (
40017020, 40017429, 40031851, 40031856, 40031857, 40032337, 40012018, 40016672, 40010160, 40015069,
 40015070, 40029985, 40014456, 40014457, 40014462, 40014463, 40030806, 40014465, 40014466, 202870,
40025519, 100064, 40012457, 40031859, 40013623, 20100090, 40032453, 40032457, 40034294, 40014967,
40014471, 40014472, 20201711, 20201785, 40014468, 40014469, 20100116, 40016773, 40005868, 40033585,
 40033586, 40025813, 315305, 40014791, 40014792, 202729, 40032253, 40032254, 20100160, 40030847,
40032250, 40032251, 40024254, 40030854, 40030855, 40033591, 40033592, 20202247, 40014474, 40014475,
202261, 40021891, 40014477, 40014478, 40012355, 40017959, 40005017, 40014459, 40014460, 40030850,
 40014480, 40014481, 200287, 40022148, 40022149, 40014995, 40011239, 40011242, 40031768, 40034171,
40011246, 40019386, 40028973, 40029003, 402302, 40020726, 40030845, 40030846, 40033588, 40033589,
314246, 20201549, 40011993, 20401262, 20401279, 20401277, 20401278, 20401291, 20401273, 40021845,
 40020725, 40014483, 40014484, 40021609, 40005593, 309267, 201171, 40010997, 40020903)
union
-- records that have scope notes
select
    convert(varchar(10), l.lexicon_id) as short_identifier,
    l.term as display_name,
    case
        when l.node_type in (
            'cl', 'div', 'f', 'fam', 'gen', 'or', 'sect', 'sp',
            'subcl', 'subgen', 'subsect', 'subsp', 'var')
        then l.node_type
        else null
    end as taxon_rank,
    fam.term as family,
    case when fam.term is null or fam.term = ''
        then null
        else
            'urn:cspace:botgarden.cspace.berkeley.edu:taxonomyauthority:name(taxon):item:name('
            + convert(varchar(10), fam.ancestor_id) + ')''' + fam.term + ''''
    end as family_refname,
    tn.notho_taxon as hybrid_fg,
    case
        when tn.access_code is not null then
            case
                when tn.access_code = 0 then 'Unrestricted'
                when tn.access_code = 1 then 'Restricted'
                when tn.access_code = 2 then 'Dead'
            end
        else
            case
                when l.access_code = 0 then 'Unrestricted'
                when l.access_code = 1 then 'Restricted'
                when l.access_code = 2 then 'Dead'
            end
    end as access_restrictions,
    convert(varchar(255), sn.scope_notes) as taxon_note
from
    lexicon l,
    taxon_name tn,
    (select tc.descendant_id, tc.ancestor_id, l.term
        from trans_closure tc, lexicon l
        where tc.ancestor_id = l.lexicon_id
        and l.node_type = 'fam') fam,
    scope_notes sn
where l.lexicon_id *= tn.taxon_id
and l.lexicon_id *= fam.descendant_id
and l.node_type not in ('common', 'bioname', 'na', 'root', 'cv')
and l.lexicon_id = sn.lexicon_id
and l.lexicon_id not in (
    select lexicon_id from lexicon_term_relation
    where relation_type in ('TRV', 'TNA'))
union
-- cultivar records without scope notes
select
    convert(varchar(10), c.lexicon_id) as short_identifier,
    c.new_term as display_name,
    'cv' as taxon_rank,
    fam.term as family,
    case when fam.term is null or fam.term = ''
        then null
        else
            'urn:cspace:botgarden.cspace.berkeley.edu:taxonomyauthority:name(taxon):item:name('
            + convert(varchar(10), fam.ancestor_id) + ')''' + fam.term + ''''
    end as family_refname,
    tn.notho_taxon as hybrid_fg,
    case
        when tn.access_code is not null then
            case
                when tn.access_code = 0 then 'Unrestricted'
                when tn.access_code = 1 then 'Restricted'
                when tn.access_code = 2 then 'Dead'
            end
        else
            case
                when c.access_code = 0 then 'Unrestricted'
                when c.access_code = 1 then 'Restricted'
                when c.access_code = 2 then 'Dead'
            end
    end as access_restrictions,
    null as taxon_note
from
    cspace_lexicon_cv c,
    taxon_name tn,
    (select tc.descendant_id, tc.ancestor_id, l.term
        from trans_closure tc, lexicon l
        where tc.ancestor_id = l.lexicon_id
        and l.node_type = 'fam') fam
where c.lexicon_id *= tn.taxon_id
and c.lexicon_id *= fam.descendant_id
and c.lexicon_id not in (select lexicon_id from scope_notes)
and c.lexicon_id not in (
    select lexicon_id from lexicon_term_relation
    where relation_type in ('TRV', 'TNA'))
union
-- cultivar records that have scope notes
select
    convert(varchar(10), c.lexicon_id) as short_identifier,
    c.new_term as display_name,
    'cv' as taxon_rank,
    fam.term as family,
    case when fam.term is null or fam.term = ''
        then null
        else
            'urn:cspace:botgarden.cspace.berkeley.edu:taxonomyauthority:name(taxon):item:name('
            + convert(varchar(10), fam.ancestor_id) + ')''' + fam.term + ''''
    end as family_refname,
    tn.notho_taxon as hybrid_fg,
    case
        when tn.access_code is not null then
            case
                when tn.access_code = 0 then 'Unrestricted'
                when tn.access_code = 1 then 'Restricted'
                when tn.access_code = 2 then 'Dead'
            end
        else
            case
                when c.access_code = 0 then 'Unrestricted'
                when c.access_code = 1 then 'Restricted'
                when c.access_code = 2 then 'Dead'
            end
    end as access_restrictions,
    convert(varchar(255), sn.scope_notes) as taxon_note
from
    cspace_lexicon_cv c,
    taxon_name tn,
    (select tc.descendant_id, tc.ancestor_id, l.term
        from trans_closure tc, lexicon l
        where tc.ancestor_id = l.lexicon_id
        and l.node_type = 'fam') fam,
    scope_notes sn
where c.lexicon_id *= tn.taxon_id
and c.lexicon_id *= fam.descendant_id
and c.lexicon_id = sn.lexicon_id
and c.lexicon_id not in (
    select lexicon_id from lexicon_term_relation
    where relation_type in ('TRV', 'TNA'))
order by short_identifier;

Notes

Not getting author shortIDs in view, just names. For now putting them in Taxon Notes. Adjust view or get them separately in SQL.
Do we need to adjust the format of scientific names?
Lots of other Sage data for Names to add
SQL query above is a union of names without scope notes and names with scope notes. the query that added an outer join with the scope notes table took over 50 minutes to run, so it was easier to split it out into a union of two queries.
SQL query above excludes the following node_types: bioname, common, na, root. It also excludes TRE, TRV, and TNA terms. TNA terms will be added as an alternate name of type 'no author name'. TRE names are all bionames, so the bioname exclusion will also exclude TRE names.
This query requires a table be refreshed on SAGE, cspace_lexicon_cv.

Talend/ETL

bgtaxonimport2 0.2

MERGE: Taxon Terms (repeating group) - Lam (SQL ready), Yuteh (Talend)

Field mapping

SQL

Note: This query requires a table be refreshed on SAGE, cspace_lexicon_cv.

-- verified names
select
    convert(varchar(10), l.lexicon_id) as short_identifier,
    l.term as display_name,
    l.term as term_name,
    'Verified Taxonomic Name' as term_type,
    an.name as term_source,                                                     
    an.agent_id as term_source_uid,
    'accepted' as term_status,
    case
        when tn.catalog_flag = 1 then 'valid'
        else 'invalid'
    end as taxonomic_status,
    1 as preferred_flag
from
    lexicon l,
    taxon_name tn,
    agent_name an,
    agent_name_role anr
where l.lexicon_id = tn.taxon_id
and tn.data_src_id = an.agent_id
and an.agent_id = anr.agent_id
and an.agent_name_id = anr.agent_name_id
and an.name_type = anr.name_type
and an.name_type = 'pref'
and anr.role_type = 'datasrc'
and l.node_type not in ('common', 'bioname', 'na', 'root', 'cv')
union
-- no author names
select
    convert(varchar(10), rl.lexicon_id) as short_identifier,
    l.term as display_name,
    l.term as term_name,
    'Taxon No Author Name' as term_type,
    null as term_source,
    null as term_source_uid,
    'accepted' as term_status,
    case
        when l.valid_catalog_term_flag = 1 then 'valid'
        else 'invalid'
    end as taxonomic_status,
    0 as preferred_flag
from
    lexicon l,
    lexicon_term_relation ltr,
    lexicon rl
where l.lexicon_id = ltr.lexicon_id
and ltr.relation_type = 'TNA'
and ltr.related_lexicon_id = rl.lexicon_id
and l.node_type != 'cv'
and l.term != rl.term
union
-- remaining 821 terms that are not verified terms and are not no author names:
select
    convert(varchar(10), l.lexicon_id) as short_identifier,
    l.term as display_name,
    l.term as term_name,
    'Unverified Name' as term_type,
    null as term_source,
    null as term_source_uid,
    'accepted' as term_status,
    case
        when l.valid_catalog_term_flag = 1 then 'valid'
        else 'invalid'
    end as taxonomic_status,
    0 as preferred_flag
from
    lexicon l
where l.lexicon_id not in (
    select taxon_id from taxon_name)
and l.node_type not in ('bioname', 'common', 'root', 'na', 'cv')
and l.lexicon_id not in (
    select lexicon_id from lexicon_term_relation where relation_type = 'TNA')
and l.lexicon_id not in (
    select lexicon_id from lexicon_term_relation where relation_type = 'TRV')
union
-- cv verified names
select
    convert(varchar(10), c.lexicon_id) as short_identifier,
    c.new_term as display_name,
    c.new_term as term_name,
    'Verified Taxonomic Name' as term_type,
    an.name as term_source,
    an.agent_id as term_source_uid,
    'accepted' as term_status,
    case
        when tn.catalog_flag = 1 then 'valid'
        else 'invalid'
    end as taxonomic_status,
    1 as preferred_flag
from
    cspace_lexicon_cv c,
    taxon_name tn,
    agent_name an,
    agent_name_role anr
where c.lexicon_id = tn.taxon_id
and tn.data_src_id = an.agent_id
and an.agent_id = anr.agent_id
and an.agent_name_id = anr.agent_name_id
and an.name_type = anr.name_type
and an.name_type = 'pref'
and anr.role_type = 'datasrc'
union
-- cv no author names
select
    convert(varchar(10), rl.lexicon_id) as short_identifier,
    c.new_term as display_name,
    c.new_term as term_name,
    'Taxon No Author Name' as term_type,
    null as term_source,
    null as term_source_uid,
    'accepted' as term_status,
    case
        when c.valid_catalog_term_flag = 1 then 'valid'
        else 'invalid'
    end as taxonomic_status,
    0 as preferred_flag
from
    cspace_lexicon_cv c,
    lexicon_term_relation ltr,
    cspace_lexicon_cv rl
where c.lexicon_id = ltr.lexicon_id
and ltr.relation_type = 'TNA'
and ltr.related_lexicon_id = rl.lexicon_id
and c.new_term != rl.new_term
union
-- remaining 821 terms that are not verified terms and are not no author names:
select
    convert(varchar(10), c.lexicon_id) as short_identifier,
    c.new_term as display_name,
    c.new_term as term_name,
    'Unverified Name' as term_type,
    null as term_source,
    null as term_source_uid,
    'accepted' as term_status,
    case
        when c.valid_catalog_term_flag = 1 then 'valid'
        else 'invalid'
    end as taxonomic_status,
    0 as preferred_flag
from
    cspace_lexicon_cv c
where c.lexicon_id not in (
    select taxon_id from taxon_name)
and c.lexicon_id not in (
    select lexicon_id from lexicon_term_relation where relation_type = 'TNA')
and c.lexicon_id not in (
    select lexicon_id from lexicon_term_relation where relation_type = 'TRV')
union
-- grex names
select
    convert(varchar(10), l.lexicon_id) as short_identifier,
    rtrim(str_replace(substring(l.term, 1, patindex('% ''%''', l.term)), ' cv. ', ' ')) as display_name,
    rtrim(str_replace(substring(l.term, 1, patindex('% ''%''', l.term)), ' cv. ', ' ')) as term_name,
    'Grex Name' as term_type,
    an.name as term_source,
    an.agent_id as term_source_uid,
    'accepted' as term_status,
    case
        when tn.catalog_flag = 1 then 'valid'
        else 'invalid'
    end as taxonomic_status,
    0 as preferred_flag
from
    lexicon l,
    taxon_name tn,
    agent_name an,
    agent_name_role anr
where l.lexicon_id = tn.taxon_id
and tn.data_src_id = an.agent_id
and an.agent_id = anr.agent_id
and an.agent_name_id = anr.agent_name_id
and an.name_type = anr.name_type
and an.name_type = 'pref'
and anr.role_type = 'datasrc'
and l.node_type = 'cv'
and l.term like '%cv. _%''%'''
and l.term not like 'cv. %'
and l.lexicon_id != 40025687
order by preferred_flag desc, short_identifier;

Notes

Note: This query requires a table be refreshed on SAGE, cspace_lexicon_cv.

Talend/ETL

MERGE: Taxon Authors (repeating group) - Lam (SQL ready)

Field mapping

SQL

select
    tn.taxon_id as legacy_id,
    an.agent_id as agent_id,
    tn.author_id as author_name_id,
    an.name as author_name,
    'author' as author_type,
    1 as author_order
from
    taxon_name tn,
    agent_name an
where tn.author_id = an.agent_name_id
and an.name_type = 'kewabbr' and tn.author_id <> 5
union
select
    tn.taxon_id as legacy_id,
    an.agent_id as agent_id,
    tn.ascr_auth_id as author_name_id,
    an.name as author_name,
    'ascribed author' as author_type,
    2 as author_order
from
    taxon_name tn,
    agent_name an
where tn.ascr_auth_id = an.agent_name_id
and an.name_type = 'kewabbr' and tn.ascr_auth_id <> 5
union
select
    tn.taxon_id as legacy_id,
    an.agent_id as agent_id,
    tn.p_author_id as author_name_id,
    an.name as author_name,
    'parenthetical author' as author_type,
    3 as author_order
from
    taxon_name tn,
    agent_name an
where tn.p_author_id = an.agent_name_id
and an.name_type = 'kewabbr' and tn.p_author_id <> 5
union
select
    tn.taxon_id as legacy_id,
    an.agent_id as agent_id,
    tn.p_ascr_auth_id as author_name_id,
    an.name as author_name,
    'parenthetical ascribed author' as author_type,
    4 as author_order
from
    taxon_name tn,
    agent_name an
where tn.p_ascr_auth_id = an.agent_name_id
and an.name_type = 'kewabbr' and tn.p_ascr_auth_id <> 5
order by legacy_id, author_order

Notes

Talend/ETL

MERGE: Taxon Common Names (repeating group) - Lam (SQL ready)

Field mapping

Need to get relationship type to and map to values in CSpace: Preferred, Secondary, Rejected

SQL

select
    ltr.related_lexicon_id as legacy_id,
    ltr.lexicon_id as common_name_id,
    l.term as common_name,
    'English' as language,
    case when ltr.relation_type = 'TCN' then 'preferred'
         when ltr.relation_type = 'TCNS' then 'secondary'
         when ltr.relation_type = 'TCNR' then 'rejected'
         end as common_name_type
from
    lexicon_term_relation ltr,
    lexicon l
where ltr.lexicon_id = l.lexicon_id
and ltr.relation_type in ('TCN', 'TCNS', 'TCNR')

Notes

Talend/ETL

MERGE: Taxon Attributes (repeating group) - Lam (SQL ready)

E.g., dimensions, habitat, etc.

Field mapping

SQL

select
    pa.lexicon_id as legacy_id,
    pa.date_entered as attribute_date,
    pa.agent_id as recorded_by_id,
    rb.name as recorded_by_name,
    pa.plant_height as height,
    pa.plant_width as width,
    pa.plant_dbh as dbh,
    pa.plant_habit as habitat,
    pa.climate_rating as climate_rating,
    ca.agent_id as conservation_agent_id,
    ca.name as conservation_agent_name,
    cc.conserv_cat as convservation_code,
    case
     when ca.name = 'California Native Plant Society' then 'CNPS'
     when ca.name = 'not applicable' then ''
     when ca.name = 'International Union for Conservation of Nature and Natural Resources - 1997' then 'IUCN-1997'
--   when ca.name = 'Convention on International Trade in Endangered Species' then 'CNPS'
     when ca.name = 'Convention on International Trade in Endangered Species' then 'CITES'
     when ca.name = 'Rhododendron Species Foundation' then 'RSF'
--   when ca.name = 'International Union for Conservation of Nature and Natural Resources - 2006' then 'CNPS'
     when ca.name = 'International Union for Conservation of Nature and Natural Resources - 2006' then 'IUCN-2006'
     when ca.name = 'South African National Biodiversity Institute' then 'SANBI'
     when ca.name = 'UNEP World Conservation Monitoring Center - CITES' then 'UNEP-CITES'
     when ca.name = 'New York Botanical Garden' then 'NYBG'
     when ca.name = 'US Fish & Wildlife Service' then 'USFWS'
     when ca.name = 'California Department of Fish & Game' then 'CADFG'
     when substring(cc.comp_key, 1, 5) = '60801' or ca.name = 'IUCN/SSC Global Tree Specialist Group, Red List of Rhododendrons - 2011' then 'IUCN-Rhododendrons-2011'
     when substring(cc.comp_key, 1, 5) = '60804' or ca.name = 'International Union for Conservation of Nature and Natural Resources - 2011.1' then 'IUCN-2011.1'
     when substring(cc.comp_key, 1, 5) = '60849' or ca.name = 'International Union for Conservation of Nature and Natural Resources - 2011.2' then 'IUCN-2011.2'
     when substring(cc.comp_key, 1, 5) = '60905' or ca.name = 'International Union for Conservation of Nature and Natural Resources - 2012.1' then 'IUCN-2012.1'
     when substring(cc.comp_key, 1, 5) = '60917' or ca.name = 'International Union for Conservation of Nature and Natural Resources - 2012.2' then 'IUCN-2012.2'
     when ca.name = 'Non-conservation publication' then 'Non-conservation pub'
     when cc.conserv_id in (38, 39, 41, 42, 47) then  ''
     end as conservCode_append,
    pa.frost_sensitive as frost_sensitive,
    pa.medicinal_flag as medicinal_use,
    pa.economic_flag as economic_use
from
    plant_attr pa,
    agent_name rb,
    agent_name ca,
    conserv_code cc,
    agent_name_role anr
where pa.agent_id = rb.agent_id
and rb.name_type = 'pref'
and pa.conserv_id = cc.conserv_id
and cc.comp_key *= anr.comp_key
and cc.role_type *= anr.role_type
and cc.role_type = 'conserv'
and anr.agent_id *= ca.agent_id
and anr.agent_name_id *= ca.agent_name_id
and anr.name_type *= ca.name_type
and ca.name_type = 'pref'

Notes

Talend/ETL

MERGE: Taxon Related Terms (repeating group) - Lam (SQL ready)

Might be needed for relationships to synonyms, unverified terms, verified terms (?), to hybrid parent names (?), etc.

Field mapping

SQL

select
    ltr.related_lexicon_id as legacy_id,
    ltr.lexicon_id as related_term_id,
    l.term as related_term,
    ltr.relation_type as relation_type
    'urn:cspace:botgarden.cspace.berkeley.edu:taxonomyauthority:name(taxon):item:name('
        + convert(varchar(10), ltr.lexicon_id) + ')''' + l.term + ''''
        as related_term_refname
from
    lexicon_term_relation ltr,
    lexicon l
where ltr.lexicon_id = l.lexicon_id
and ltr.relation_type in ('FHP', 'MHP', 'SYN')
and l.node_type != 'cv'
union
-- cv relations
select
    ltr.related_lexicon_id as legacy_id,
    ltr.lexicon_id as related_term_id,
    c.term as related_term,
    ltr.relation_type as relation_type
    'urn:cspace:botgarden.cspace.berkeley.edu:taxonomyauthority:name(taxon):item:name('
        + convert(varchar(10), ltr.lexicon_id) + ')''' + c.term + ''''
        as related_term_refname
from
    lexicon_term_relation ltr,
    cspace_lexicon_cv c
where ltr.lexicon_id = c.lexicon_id
and ltr.relation_type in ('FHP', 'MHP', 'SYN')
order by legacy_id, relation_type;

Notes

Talend/ETL

Taxonomic Hierarchy (hierarchical relations) - Lam (SQL ready)

Based on parent taxa identified in taxon table

Field mapping

SQL

select
    taxon_id as short_identifier,
    parent_id as parent_id
from
    taxon_name tn,
    lexicon l
where tn.taxon_id = l.lexicon_id
and l.node_type not in ('common', 'bioname', 'na', 'root') and parent_id <> 0
order by short_identifier;

Notes

Talend/ETL

Common Taxonomic Names (common vocabulary in taxonomy authority) - Lam (SQL ready), Yuteh (Talend)

Should get Access Code in this SQL too, right?

Field mapping

SQL

-- this query includes 16 common name duplicates.
-- duplicates are referenced in lexicon_term_relation.lexicon_id where relation_type in ('TCN', 'TCNS', 'TCNR')
select
    l.lexicon_id as short_identifier,
    l.term  as display_name,
    case
                when l.access_code = 0 then 'Unrestricted'
                when l.access_code = 1 then 'Restricted'
                when l.access_code = 2 then 'Dead'
    end  as access_restrictions
from
    lexicon l
where l.node_type = 'common';

Notes

exact duplicates (across term, access_code, valid_catalog_term_flag):

select count(*), term
from lexicon where node_type = 'common'
group by term, access_code, valid_catalog_term_flag
having count(*) > 1
order by term

2 Arnica
2 California-Lilac
2 crimson glory vine
2 Hakeke, Mountain Holly
2 Horoeka, Lancewood
2 Horopito, Pepper Tree
2 Lion's Spoor
2 Monkeyflower
2 Mountain Rimu, Pigmy pine
2 Mountain totara, Snow totara
2 Papapa, Snowberry
2 Tanekaha, Celery pine
2 Tarata, Lemonwood
2 Wi Kura, Narrow-leaved Snow Tussock

These two names are term duplicates (access_code or valid_catalog_term_flag are different)

2 SECTION CHONIASTRUM
2 ti kauka, Cabbage tree

Here are the duplicates that are referenced in lexcon_term_relation for common names:

select distinct l.lexicon_id, l.term
from lexicon l, lexicon_term_relation ltr
where l.lexicon_id = ltr.lexicon_id
and relation_type in ('TCN', 'TCNS', 'TCNR')
and l.term in (
        select term
        from lexicon
        where node_type = 'common'
        group by term
        having count(*) > 1)
order by term, lexicon_id;

40000115 Arnica
40002353 Arnica
40000552 California-Lilac
40000588 California-Lilac
40006165 Crimson Glory Vine
40001587 Hakeke, Mountain Holly
40001588 Hakeke, Mountain Holly
40001687 Horoeka, Lancewood
40001688 Horoeka, Lancewood
40001690 Horopito, Pepper Tree
40001691 Horopito, Pepper tree
40024379 Lion's Spoor
40024418 Lion's Spoor
40002311 Monkeyflower
40002314 Monkeyflower
40002384 Mountain Rimu, Pigmy pine
40002385 Mountain Rimu, Pigmy Pine
40002391 Mountain totara, Snow totara
40002392 Mountain totara, Snow totara
40002615 Papapa, Snowberry
40002616 Papapa, Snowberry
40003741 Tanekaha, Celery pine
40003742 Tanekaha, Celery pine
40003751 Tarata, Lemonwood
40003752 Tarata, Lemonwood
40003809 Ti kauka, Cabbage tree
40004151 Wi Kura, Narrow-leaved Snow Tussock
40004152 Wi Kura, Narrow-leaved snow tussock

Note: 40004151 and 40004152 are duplicates because of case insensitivity in Sybase db.

Talend/ETL

Plant Sales taxonomic vocabulary (plantsales vocabulary in taxonomy authority)

We will be constructing a taxonomy vocabulary (third instance) called Plant Sales that will be used to support the Pot Tag procedure. This will be made up of the unique taxonomic names in SAGE's volunteer_label table (using the formatted name field).

Question: Will we also use family names from the volunteer_label table?

Field Mapping

...

SQL

/*
-- View to get smallest label_id as the short identifier for the formatted_taxon name.
-- Manually excludes 7 sets of duplicates due to the taxon name being a valid genus name as well as a species name.
-- The name in volunteer_label.formatted_taxon is the genus name, but due to the Sybase db being case insensitive, it creates a false match.

create view cspace_formatted_taxon_id_rank as
select
    min(vl.label_id) as label_id,
    vl.formatted_taxon,
    l.node_type,
    l.term
from volunteer_label vl, lexicon l
where vl.formatted_taxon *= l.term
and l.node_type not in ('bioname', 'common', 'na', 'root')
and l.lexicon_id not in (
    20300328, 20300815, 20302343, 20302378, 40022302, 20304254, 20304427)
group by vl.formatted_taxon, l.node_type, l.term;

*/

-- Query to get the plant sales taxon main record.
-- There is only one term name, so no need for a second query for term names for xml merge.

select distinct
    id_rank.label_id as short_identifier,
    vl.formatted_taxon as display_name,
    case
        when id_rank.node_type is not null then id_rank.node_type
        else
            case
                when vl.formatted_taxon like '% cv. %' then 'cv'
                when vl.formatted_taxon like '% var. %' then 'var'
                when (vl.formatted_taxon like '% f. %'
                    and vl.formatted_taxon not like '% F. Ritter%'
                    and vl.formatted_taxon not like '% F. Muell.%'
                    and vl.formatted_taxon not like '% F. Dietr.%'
                    and vl.formatted_taxon not like '% F. Schmidt%'
                    and vl.formatted_taxon not like '% F. Phil.%'
                    and vl.formatted_taxon not like '% F. Schwarz%') then 'f'
                when vl.formatted_taxon like '% subsp. %' then 'subsp'
                else 'sp'
            end
    end as taxon_rank,
    vl.formatted_taxon as term_name,
    null as term_type,
    'accepted' as term_status,
    'valid' as taxonomic_status
from
    volunteer_label vl,
    cspace_formatted_taxon_id_rank id_rank
where vl.formatted_taxon *= id_rank.formatted_taxon
and vl.formatted_taxon is not null;

Notes

...

Talend/ETL

...

Garden Locations (Storage locations vocabulary) - SQL ready, Talend (Chris)

See attached garden locations map

Field mapping

SQL

SELECT
    'garden'||convert(varchar(5), garden_loc_id) shortidentifier,
    loc_type locationType,
    ltrim(full_loc_name) locationDisplayName,
    garden_loc_id legacy_id,
    parent_id,
    loc_hier termSourceNote,
    ltrim(loc_name) locationName,
    entered_staff_id,
    date_entered,
    last_change_staff_id,
    last_change_date
FROM
    dbo.garden_location

Notes

Location Types: bed, section, collection
Location Type format: <locationType>urn:cspace:botgarden.cspace.berkeley.edu:vocabularies:name(locationtype):item:name(drawer)'Drawer'</locationType>
Vocabulary CSID to persist: c2833911-4bca-49aa-a3f3
Refname format: <refName>urn:cspace:botgarden.cspace.berkeley.edu:locationauthorities:name(location):item:name(crhbed11350962164634)'crhbed1'</refName>

Talend/ETL

bgimportgardenlocation 0.2 in Chris's Talend botgarden repository

Rename output file from ucbggardenlocation job to gardenlocout.txt!!

Garden Locations Hierarchy (hierarchical relationships) - SQL ready, Talend (Chris)

Field mapping

SQL

Notes

Talend/ETL

Rename output file from ucbggardenlocation job to gardenlocout.txt!!

ucbglocrelations v0.1in Chris's Talend botgarden repository

Job creates 571 relations, but one needs to be removed. garden0/'none' has no broader record but has output in the file. Remove by hand. Looked like:

<import service="Relations" type="Relation" CSID="1fde0aff-dbb0-4446-9745-eaa6df4039ae">
  <schema xmlns:relations_common="http://collectionspace.org/relation" name="relations_common">
    <relationshipType>hasBroader</relationshipType>
    <subjectDocumentType>Locationitem</subjectDocumentType>
    <subjectCsid>96fbf471-1f5f-4dab-8a25-516f0eab3e5b</subjectCsid>
    <subjectUri>/locationauthorities/dbf254e8-437d-4926-be5d/items/96fbf471-1f5f-4dab-8a25-516f0eab3e5b</subjectUri>
    <subjectRefName>urn:cspace:botgarden.cspace.berkeley.edu:locationauthorities:name(location):item:name(garden0)'none'</subjectRefName>
    <objectDocumentType>Locationitem</objectDocumentType>
    <objectCsid/>
    <objectUri/>
    <objectRefName/>
  </schema>
</import>

Place Authority - Lam (SQL), Yuteh (Talend)

Field mapping

SQL

select
    gl.geog_lex_id as place_id,
    gl.geog_term as display_name,
    gtc.geog_type_name as place_type,
    gn.notes as place_note
from
    geog_lexicon gl,
    geog_type_code gtc,
    geog_name gn
where gl.geog_lex_id not in (
    select geog_lex_id
    from geog_term_relation
    where relation_type in ('SYN', 'RQN'))
and gl.geog_type not in ('element', 'root')
and gl.geog_type = gtc.geog_type
and gl.geog_lex_id *= gn.geog_id;

Notes

Talend/ETL

MERGE: Place terms (repeating group)

Field mapping

SQL

-- geography term name repeating group
-- verified geographic names
select
    gl.geog_lex_id as place_id,
    gl.geog_term as display_name,
    gl.geog_term as term_name,
    'English' as term_language,
    gn.iso_country_cd as abbreviation,
    'Verified Geographic Name' as term_type,
    gn.data_src_id as term_source_id,
    an.name as term_source,
    gn.valid_term_flag as term_status,
    1 as preferred_flag
from
    geog_lexicon gl,
    geog_name gn,
    agent_name an
where gl.geog_lex_id = gn.geog_id
and gn.data_src_id = an.agent_id
and an.name_type = 'pref'
union
-- short names
select
    gn.geog_id as place_id,
    gl.geog_term as display_name,
    gl.geog_term as term_name,
    'English' as term_language,
    gn.iso_country_cd as abbreviation,
    'Short Name' as term_type,
    gn.data_src_id as term_source_id,
    an.name as term_source,
    0 as term_status,
    0 as preferred_flag
from
    geog_lexicon gl,
    geog_name gn,
    agent_name an
where gl.geog_lex_id = gn.geog_element_id
and gn.data_src_id = an.agent_id
and an.name_type = 'pref'
union
-- reverse qualified names and synonyms                                         
select
    gtr.related_geog_lex_id as place_id,
    gl.geog_term as display_name,
    gl.geog_term as term_name,
    'English' as term_language,
    null as abbreviation,
    case
        when gtr.relation_type = 'RQN' then 'Reverse Qualified Name'
        when gtr.relation_type = 'SYN' then 'Synonym'
        else null
    end as term_type,
    null as term_source_id,
    null as term_source,
    gl.valid_term_flag as term_status,
    0 as preferred_flag
from
    geog_term_relation gtr,
    geog_lexicon gl,
    geog_lexicon rgl
where gtr.geog_lex_id = gl.geog_lex_id
and gtr.related_geog_lex_id = rgl.geog_lex_id
and gtr.relation_type in ('RQN', 'SYN')
and gl.geog_term != rgl.geog_term
union
-- remaining 71 geog terms that are not verified names, reverse qualified names, or synonyms
select
    gl.geog_lex_id as place_id,
    gl.geog_term as display_name,
    gl.geog_term as term_name,
    'English' as term_language,
    null as abbreviation,
    'Unverified Geographic Name' as term_type,
    null as term_source_id,
    null as term_source,
    gl.valid_term_flag as term_status,
    1 as preferred_flag
from
    geog_lexicon gl
where gl.geog_lex_id not in (
    select geog_lex_id from geog_term_relation where relation_type in ('RQN', 'SYN'))
and gl.geog_lex_id not in (
    select geog_id from geog_name)
and gl.geog_type not in ('element', 'root')
order by place_id, preferred_flag, display_name;

Notes

Talend/ETL

Place hierarchy (hierarchical relationships)

Field mapping

SQL

select
    gn.geog_id as place_id,
    case
        when gn.geog_parent_id = 0 then null
        else gn.geog_parent_id
    end as parent_id
from
    geog_name gn,
    geog_lexicon gl
where gl.geog_lex_id = gn.geog_id
and gl.geog_type not in ('element', 'root');

Notes

Talend/ETL

Research Projects (Concept vocabulary) - Chris (SQL ready), Talend (Yuteh)

Field mapping

SQL

SELECT
    project_id legacy_id,
    'researchproject'||convert(varchar(20), project_id) shortIdentifier,
    project_name conceptName,
    person_id conceptPerson,
    org_id conceptOrganization,
    agent_id createdBy,
    date_entered createdAt,
    last_change_date updatedAt,
    last_change_agent_id updatedBy
FROM
    dbo.research_project

Notes

refname format: urn:cspace:botgarden.cspace.berkeley.edu:conceptauthorities:name(research_ca):item:name(research11350934474341)'research1'
vocabulary CSID to persist: deb53f97-9829-47b9-99cb

Talend/ETL

New Talend job by Yuteh: BG_concept_researchProj 0.1

Classes (Concept vocabulary) - Chris (SQL ready), Talend (Yuteh)

Field mapping

SQL

SELECT
    course_id legacy_id,
    'class'||convert(varchar(20), course_id) shortIdentifier,
    course_title conceptName,
    ad.name||' '||course_num conceptTerm_coursenum,
    person_id conceptPerson,
    cu.org_id conceptOrganization_dept,
    ad.name deptName,
    inst_id institution_orgParent,
    ai.name conceptNote_instName,
    semester conceptQualifier_semester,
    cu.agent_id createdBy,
    cu.date_entered createdAt,
    cu.last_change_date updatedAt,
    cu.last_change_agent_id updatedBy
FROM
    dbo.class_use cu
left outer join agent_name ad on (cu.org_id = ad.agent_id)
left outer join agent_name ai on (cu.inst_id = ai.agent_id)

Notes

Institution (inst_id) is treated as the parent institution of the department (org_id). However, this does not necessarily match what is recorded in the agent tables. We will put the institution name in the scope notes field.
refname format: urn:cspace:botgarden.cspace.berkeley.edu:conceptauthorities:name(class_ca):item:name(crhtestclass11350934263503)'crhtestclass1'
vocabulary CSID to persist: 96141839-4bac-4e3d-a511

Talend/ETL

New Talend job by Yuteh: BG_concept_class 0.1

Propagation medium (Concept vocabulary) - Chris (SQL ready), Talend (Yuteh)

Field mapping

SQL

SELECT
    case when medium_description is null or medium_description = '' then medium_name
      else medium_description end as conceptDisplayName,
    medium_name conceptTerm,
    entered_staff_id,
    date_entered,
    last_change_staff_id,
    last_change_date
FROM
    dbo.medium
order by medium_description

Notes

Construct the shortidentifier as 'medium' + an incrementing number, in Talend
Refname format: urn:cspace:botgarden.cspace.berkeley.edu:conceptauthorities:name(concept):item:name(medium11350937075031)'medium1'
Vocabulary CSID to persist: bbe39bb2-a622-4570-a946

Talend/ETL

New Talend job by Yuteh: BG_concept_propgMedium 0.1

Conservation Category (concept vocabulary) - Chris (SQL ready), Talend (Yuteh)

Field mapping

SQL

SELECT
    conserv_id,
    'conserv'||convert(varchar(5), conserv_id) shortIdentifier,
    comp_key,
    substring(comp_key, 1, 5) consagent_name_id,
    an.name agentname,
    rtrim(conserv_cat) conserv_code,
    case when an.name = 'California Native Plant Society' then 'CNPS'
     when an.name = 'not applicable' then ''
     when an.name = 'International Union for Conservation of Nature and Natural Resources - 1997' then 'IUCN-1997'
--   when an.name = 'Convention on International Trade in Endangered Species' then 'CNPS'
     when an.name = 'Convention on International Trade in Endangered Species' then 'CITES'
     when an.name = 'Rhododendron Species Foundation' then 'RSF'
--   when an.name = 'International Union for Conservation of Nature and Natural Resources - 2006' then 'CNPS'
     when an.name = 'International Union for Conservation of Nature and Natural Resources - 2006' then 'IUCN-2006'
     when an.name = 'South African National Biodiversity Institute' then 'SANBI'
     when an.name = 'UNEP World Conservation Monitoring Center - CITES' then 'UNEP-CITES'
     when an.name = 'New York Botanical Garden' then 'NYBG'
     when an.name = 'US Fish & Wildlife Service' then 'USFWS'
     when an.name = 'California Department of Fish & Game' then 'CADFG'
     when substring(comp_key, 1, 5) = '60801' or an.name = 'IUCN/SSC Global Tree Specialist Group, Red List of Rhododendrons - 2011' then 'IUCN-Rhododendrons-2011'
     when substring(comp_key, 1, 5) = '60804' or an.name = 'International Union for Conservation of Nature and Natural Resources - 2011.1' then 'IUCN-2011.1'
     when substring(comp_key, 1, 5) = '60849' or an.name = 'International Union for Conservation of Nature and Natural Resources - 2011.2' then 'IUCN-2011.2'
     when substring(comp_key, 1, 5) = '60905' or an.name = 'International Union for Conservation of Nature and Natural Resources - 2012.1' then 'IUCN-2012.1'
     when substring(comp_key, 1, 5) = '60917' or an.name = 'International Union for Conservation of Nature and Natural Resources - 2012.2' then 'IUCN-2012.2'
     when an.name = 'Non-conservation publication' then 'Non-conservation pub'
     when conserv_id in (38, 39, 41, 42, 47) then ''
     end as conservCode_append,
    cat_descr conserv_descr,
    case when red_dot = 1 then 'red dot on label'
      end as scope_note_red_dot
FROM
    dbo.conserv_code c
left outer join agent_name an on (substring(comp_key, 1, 5) = convert(varchar(20), an.agent_id) and an.name_type = 'pref')
-- where conserv_id not in (38, 39, 41, 42, 47)
order by comp_key

Notes

Conservation Categories are codes that are tied to Conservation Organizations. Each organization can have its own list of conservation codes, and these can and do change over time. Most have a code of 'rare'.

Need to figure out what display name will be. Should the display name in the vocabulary be like: "EN IUCN-1997", "IUCN-1997: Endangered - meets criteria for endangered and is facing a high risk of extinction in the wild". Think about how term completion will work. Might need to derive some organization abbreviations for these display names. Consider these "endangered" categories from different organizations (org, code, description):

EN        	International Union for Conservation of Nature and Natural Resources - 1997	Taxa in danger of extinction and whose survival is unlikely if the causal factors continue operating
EN        	International Union for Conservation of Nature and Natural Resources - 2006	(null)
EN        	UNEP World Conservation Monitoring Center - CITES	(null)
EN        	(null)	(null)
EN        	New York Botanical Garden	(null)
EN        	(null)	(null)
EN        	IUCN/SSC Global Tree Specialist Group, Red List of Rhododendrons - 2011	Endangered - meets criteria for endangered and is facing a high risk of extinction in the wild
EN        	International Union for Conservation of Nature and Natural Resources - 2011.1	endangered
EN        	Non-conservation publication	endangered
EN        	International Union for Conservation of Nature and Natural Resources - 2011.2	vulnerable
EN        	International Union for Conservation of Nature and Natural Resources - 2012.1	endangered
Endangered	US Fish & Wildlife Service	Endangered Species Act listing
Endangered	California Department of Fish & Game	(null)

Talend/ETL

New Talend job by Yuteh: BG_concept_conservCatg 0.1

Collection Objects

Accessions - SQL ready (Chris), Talend dev (Chris), Data botgarden-dev loaded (Chris)

See attached accessions map

Field mapping

SQL

SELECT
    rtrim(a.accession_number) objectnumber,
    convert(date, accession_date) accessionDate,
    case when dead_flag=1 then 'true' else 'false' end as dead_flg,
    convert(date, dead_date) deadDate,
    material_type briefDescMaterial,
    case
      when data_quality = 1 then 'Complete'
      when data_quality = 0 then 'Unknown'
    end as recordStatusQuality,
    material_source,
    case when requested_by <> 0 then requested_by end as requested_by,
    case when collector_id <> 0 then collector_id end as collector_id,
    rtrim(collector_number) collector_number,
    collection_date,
    case when begin_date > 0 then dateadd(dd,(begin_date - 2415021),
        convert(date, '1900-01-01', 103))
    end as begin_date_scalar,
    case when begin_date > 0 then datepart(year, dateadd(dd,(begin_date - 2415021),
        convert(datetime, '1900-01-01', 103)))
    end as begin_date_year,
    case when begin_date > 0 then datepart(month, dateadd(dd,(begin_date - 2415021),
        convert(datetime, '1900-01-01', 103)))
    end as begin_date_month,
    case when begin_date > 0 then datepart(day, dateadd(dd,(begin_date - 2415021),
        convert(datetime, '1900-01-01', 103)))
    end as begin_date_day,
    case when end_date > 0 then dateadd(dd,(end_date - 2415021),
        convert(date, '1900-01-01', 103))
    end as end_date_scalar,
    case when end_date > 0 then datepart(year, dateadd(dd,(end_date - 2415021),
        convert(datetime, '1900-01-01', 103)))
    end as end_date_year,
    case when end_date > 0 then datepart(month, dateadd(dd,(end_date - 2415021),
        convert(datetime, '1900-01-01', 103)))
    end as end_date_month,
    case when end_date > 0 then datepart(day, dateadd(dd,(end_date - 2415021),
        convert(datetime, '1900-01-01', 103)))
    end as end_date_day,
    case when an.notes is not null then accession_notes||'; '||an.notes else accession_notes end
 as accnNotes,
    case when plant_status='1' then 'true' else 'false' end as rare_flag,
    case when cbd_restrict_flag=1 then 'true' else 'false' end as cbd_restrict_flag,
    a.agent_id,
    a.date_entered,
    a.last_change_agent_id,
    a.last_change_date,
    case when l.access_code = 1 then 'no' else 'yes' end as posttopublic,
    cs.habitat_descr fieldCollectionNote,
    cs.veg_type vegetationType,
    flower_color,
    fruit_color,
    security_risk,
    case when (prov_type_cd = 'G' or cs.geog_lex_id in (2031,
  2326)) then 'true' end as cultivar_flag,
    case when prov_type_cd = 'W' then 'W-wild source'
     when prov_type_cd = 'Z' then 'Z-cultivated from wild source'
     when (prov_type_cd = 'G' or cs.geog_lex_id in (2031, 2326)) then 'G-not wild source'
     when prov_type_cd = 'U' then 'U-unknown'
     end as provenance_type,
    case when prop_hist_cd = 'VA' then 'vegetative reproduction'
     when prop_hist_cd = 'S' then 'sexual reproduction'
     end as propagation_history,
    case when class_use_flag = 1 then 'true' else 'false' end as class_use_flg,
    case when research_flag = 1 then 'true' else 'false' end as research_flg,
    case when distrib_flag = 1 then 'true' else 'false' end as distrib_flg,
    case when ignore_red_dot_flag = 1 then 'true' else 'false' end as ignore_red_dot_flg,
    case when fragrance_flag = 1  then 'true' when fragrance_flag=0 then 'false' end as
 fragrance_flg,
    case when breeding_system = 'M' then 'male'
     when breeding_system = 'F' then 'female'
     when breeding_system = 'B' then 'both'
     when breeding_system = 'Q' then 'dioecious and unknown sex'
     when breeding_system = 'U' then 'unknown'
     end as breeding_system,
    case when flower_jan = 0 then 'No'
     when flower_jan = 1 then 'Some'
     when flower_jan = 2 then 'Many'
     end as flowers_jan,
    case when flower_feb = 0 then 'No'
     when flower_feb = 1 then 'Some'
     when flower_feb = 2 then 'Many'
     end as flowers_feb,
    case when flower_mar = 0 then 'No'
     when flower_mar = 1 then 'Some'
     when flower_mar = 2 then 'Many'
     end as flowers_mar,
    case when flower_apr = 0 then 'No'
     when flower_apr = 1 then 'Some'
     when flower_apr = 2 then 'Many'
     end as flowers_apr,
    case when flower_may = 0 then 'No'
     when flower_may = 1 then 'Some'
     when flower_may = 2 then 'Many'
     end as flowers_may,
    case when flower_jun = 0 then 'No'
     when flower_jun = 1 then 'Some'
     when flower_jun = 2 then 'Many'
     end as flowers_jun,
    case when flower_jul = 0 then 'No'
     when flower_jul = 1 then 'Some'
     when flower_jul = 2 then 'Many'
     end as flowers_jul,
    case when flower_aug = 0 then 'No'
     when flower_aug = 1 then 'Some'
     when flower_aug = 2 then 'Many'
     end as flowers_aug,
    case when flower_sep = 0 then 'No'
     when flower_sep = 1 then 'Some'
     when flower_sep = 2 then 'Many'
     end as flowers_sep,
    case when flower_oct = 0 then 'No'
     when flower_oct = 1 then 'Some'
     when flower_oct = 2 then 'Many'
     end as flowers_oct,
    case when flower_nov = 0 then 'No'
     when flower_nov = 1 then 'Some'
     when flower_nov = 2 then 'Many'
     end as flowers_nov,
    case when flower_dec = 0 then 'No'
     when flower_dec = 1 then 'Some'
     when flower_dec = 2 then 'Many'
     end as flowers_dec,
    case when fruit_jan=1 then 'true' else 'false' end as fruit_jan_flg,
    case when fruit_feb=1 then 'true' else 'false' end as fruit_feb_flg,
    case when fruit_mar=1 then 'true' else 'false' end as fruit_mar_flg,
    case when fruit_apr=1 then 'true' else 'false' end as fruit_apr_flg,
    case when fruit_may=1 then 'true' else 'false' end as fruit_may_flg,
    case when fruit_jun=1 then 'true' else 'false' end as fruit_jun_flg,
    case when fruit_jul=1 then 'true' else 'false' end as fruit_jul_flg,
    case when fruit_aug=1 then 'true' else 'false' end as fruit_aug_flg,
    case when fruit_sep=1 then 'true' else 'false' end as fruit_sep_flg,
    case when fruit_oct=1 then 'true' else 'false' end as fruit_oct_flg,
    case when fruit_nov=1 then 'true' else 'false' end as fruit_nov_flg,
    case when fruit_dec=1 then 'true' else 'false' end as fruit_dec_flg
FROM
    dbo.accession a
left outer join accession_notes an on (a.accession_number = an.accession_number)
left outer join collection_site cs on (a.accession_number = cs.accession_number)
left outer join dbo.accession_attr aa on (a.accession_number = aa.accession_number)
left outer join lexicon l on (a.lexicon_id = l.lexicon_id)
where a.accession_number <> '00.0000'
order by a.accession_number

Notes

Exclude from requestedBy:
- urn:cspace:botgarden.cspace.berkeley.edu:personauthorities:name(person):item:name(0)'Owner, Database'
Exclude from identBy:
- urn:cspace:botgarden.cspace.berkeley.edu:personauthorities:name(person):item:name(0)'Owner, Database'
- urn:cspace:botgarden.cspace.berkeley.edu:null:item:name(null)'null'

Talend/ETL

ucbgaccessions 0.5 in Chris's Talend repository, botgarden

Run Steps

Merge into accessions the fragments for:
- determinations
- collection sites
- accession use
Change "schema2" and "schema3" to "schema"
Import

MERGE: Taxonomic Determinations and Hybrid Parents (repeating group) - SQL ready to test (Chris), Talend dev-sample data ready (Chris)

It will be easier to extract Hybrid Parents at the same time as determinations in this case. Talend can handle a single repeating group. Data will be denormalized in SQL via unions and then grouped by Talend. This is complicated by the fact that some identifications come from the accession table; others come from plant_identity.

Field mapping

SQL

-- for all accessions: get plant_identity records for non-hybrids, one row per plant_identity record if no hybrid parent; multiple rows per accession, one with accepted_flag=1
SELECT
    rtrim(accession_number) objectnumber,
    lexicon_id,
    case when accepted_flag = 0 then 'false'
      else 'true' end as accepted_flg,
    case when aff_lexicon_id = -5 then null
      else aff_lexicon_id end as aff_lexicon,
    'false' hybrid_flg,
    date_entered,
    case when id_made_date = '3000-01-01 00:00:00' then null
      else id_made_date end as id_made_date,
    case when id_made_by > 0 then id_made_by end as id_made_by,
    id_reference,
    id_modifier,
    id_remarks,
    null parent_id,
    null parent_qualifier,
    null fhp_lexicon_id,
    null mhp_lexicon_id,
    case when id_made_date = '3000-01-01 00:00:00' then null
      else datepart(year, id_made_date) end as id_year,
    case when id_made_date = '3000-01-01 00:00:00' then null
      else datepart(month, id_made_date) end as id_month,
    case when id_made_date = '3000-01-01 00:00:00' then null
      else datepart(day, id_made_date) end as id_day
FROM
    dbo.plant_identity
where fhp_lexicon_id = -5 and fhp_lexicon_id = -5 and accession_number <> '00.0000'
union
--  for all accessions: get plant_identity records for hybrids, usually two rows per plant_identity record if any hybrid parent; multiple rows/pairs per accession with one pair having accepted flag=1.
SELECT
    rtrim(accession_number) objectnumber,
    case when lexicon_id = -5 then null
      else lexicon_id end as lexicon_id,
    case when accepted_flag = 0 then 'false'
      else 'true' end as accepted_flg,
    case when aff_lexicon_id = -5 then null
      else aff_lexicon_id end as aff_lexicon,
    'true' hybrid_flg,
    date_entered,
    case when id_made_date = '3000-01-01 00:00:00' then null
      else id_made_date end as id_made_date,
    case when id_made_by > 0 then id_made_by end as id_made_by,
    id_reference,
    id_modifier,
    id_remarks,
    fhp_lexicon_id parent_id,
    'female' parent_qualifier,
    fhp_lexicon_id,
    mhp_lexicon_id,
    case when id_made_date = '3000-01-01 00:00:00' then null
      else datepart(year, id_made_date) end as id_year,
    case when id_made_date = '3000-01-01 00:00:00' then null
      else datepart(month, id_made_date) end as id_month,
    case when id_made_date = '3000-01-01 00:00:00' then null
      else datepart(day, id_made_date) end as id_day
FROM
    dbo.plant_identity
where fhp_lexicon_id <> -5
union
-- get male hybrid parents from plant_identity
SELECT
    rtrim(accession_number) objectnumber,
    case when lexicon_id = -5 then null
      else lexicon_id end as lexicon_id,
    case when accepted_flag = 0 then 'false'
      else 'true' end as accepted_flg,
    case when aff_lexicon_id = -5 then null
      else aff_lexicon_id end as aff_lexicon,
    'true' hybrid_flg,
    date_entered,
    case when id_made_date = '3000-01-01 00:00:00' then null
      else id_made_date end as id_made_date,
    case when id_made_by > 0 then id_made_by end as id_made_by,
    id_reference,
    id_modifier,
    id_remarks,
    mhp_lexicon_id parent_id,
    'male' parent_qualifier,
    fhp_lexicon_id,
    mhp_lexicon_id,
    case when id_made_date = '3000-01-01 00:00:00' then null
      else datepart(year, id_made_date) end as id_year,
    case when id_made_date = '3000-01-01 00:00:00' then null
      else datepart(month, id_made_date) end as id_month,
    case when id_made_date = '3000-01-01 00:00:00' then null
      else datepart(day, id_made_date) end as id_day
FROM
    dbo.plant_identity
where mhp_lexicon_id <> -5
union
-- for the 414 where there is no accepted record in plant_identity, get the accession record for non-hybrids, one row per accession record; set accepted flag=1
SELECT
    rtrim(accession_number) objectnumber,
    lexicon_id,
    'true' accepted_flg, -- assume these are accepted determinations though no record in plant_identity
    case when aff_lexicon_id = -5 then null
      else aff_lexicon_id end as aff_lexicon,
    case when hybrid_flag = 0 then 'false'
      else 'true' end as hybrid_flg,
    date_entered, -- this is accession.date_entered but probably is correct for this set
    null id_made_date,
    null id_made_by,
    null id_reference,
    null id_modifier,
    'determination from SAGE accession record' id_remarks,
    null parent_id,
    null parent_qualifier,
    null fhp_lexicon_id,
    null mhp_lexicon_id,
    null id_year,
    null id_month,
    null id_day
FROM
    dbo.accession
where fhp_lexicon_id = -5 and fhp_lexicon_id = -5
    and accession_number in (
    select a.accession_number
       from accession a
       left outer join plant_identity p on (a.accession_number = p.accession_number and p.accepted_flag =1)
       where p.accession_number is null
    )
union
-- for the 414 where there is no accepted record in plant_identity, get the accession record for hybrids, usually two rows per accession record; set accepted flag=1
SELECT
    rtrim(accession_number) objectnumber,
    lexicon_id,
    'true' accepted_flg, -- assume these are accepted determinations though no record in plant_identity
    case when aff_lexicon_id = -5 then null
      else aff_lexicon_id end as aff_lexicon,
    case when hybrid_flag = 0 then 'false'
      else 'true' end as hybrid_flg,
    date_entered, -- this is accession.date_entered but probably is correct for this set
    null id_made_date,
    null id_made_by,
    null id_reference,
    null id_modifier,
    'determination from SAGE accession record' id_remarks,
    fhp_lexicon_id parent_id,
    'female' parent_qualifier,
    fhp_lexicon_id,
    mhp_lexicon_id,
    null id_year,
    null id_month,
    null id_day
FROM
    dbo.accession
where fhp_lexicon_id <> -5
    and accession_number in (
     select a.accession_number
       from accession a
       left outer join plant_identity p on (a.accession_number = p.accession_number and p.accepted_flag =1)
       where p.accession_number is null
)
union
-- get male hybrid parents for the query above
SELECT
    rtrim(accession_number) objectnumber,
    lexicon_id,
    'true' accepted_flg, -- assume these are accepted determinations though no record in plant_identity
    case when aff_lexicon_id = -5 then null
      else aff_lexicon_id end as aff_lexicon,
    case when hybrid_flag = 0 then 'false'
      else 'true' end as hybrid_flg,
    date_entered, -- this is accession.date_entered but probably is correct for this set
    null id_made_date,
    null id_made_by,
    null id_reference,
    null id_modifier,
    'determination from SAGE accession record' id_remarks,
    mhp_lexicon_id parent_id,
    'male' parent_qualifier,
    fhp_lexicon_id,
    mhp_lexicon_id,
    null id_year,
    null id_month,
    null id_day
FROM
    dbo.accession
where mhp_lexicon_id <> -5
    and accession_number in (
     select a.accession_number
       from accession a
       left outer join plant_identity p on (a.accession_number = p.accession_number and p.accepted_flag =1)
       where p.accession_number is null
)
order by objectnumber, accepted_flg desc, date_entered desc, parent_qualifier

Notes

So this is now determinations plus hybrids in one big denormalized set. Psuedo-SQL:

1 for all accessions (including the 418 that have no accepted record in plant_identity): get plant_identity records where FHP = -5 and MHP = -5 (one row per plant_identity record if no hybrid parent; multiple rows per accession, usually one with accepted_flag=1; order by accession_number, accepted_flag desc)
union
2 for all accessions (including the 418 that have no accepted record in plant_identity): get plant_identity records where FHP <> -5 or MHP <> -5 (usually two rows per plant_identity record if any hybrid parent; multiple rows/pairs per accession with one pair having accepted flag =1. Order by accession_number, accepted_flag desc, parent_type female before male)
union
3. for the 418 accessions that have no accepted record in plant_identity, get the accession record where FHP = -5 and MHP = -5 (one row per accession record if no hybrid parent; set accepted flag=1? or not; maybe indicate this is a record from accession table; no accepted record in plant_identity)
union
4. for the 418 accessions that have no accepted record in plant_identity, get the accession records where FHP <> -5 or MHP <> -5 (usually two rows per accession record if any hybrid parent; order by accession_number, parent_type female before male; set accepted flag=1)

There are 12 non-preferred terms used throughout this set of 51K determinations. These need a different refname than is stored in the database. Rather than code that into the Talend job, I will just edit the merge file by hand to edit in the non-preferred term's refname.

Talend/ETL

Chris ucbgdeterminations 0.5 in Chris's Talend repository botgarden

MERGE: Collection Sites (not repeating but easier to extract separately) - SQL dev (Chris), Talend dev-sample data ready (Chris)

Including significant cleanup of geography data from SAGE.

Field mapping

These are the rules we are using for cleaning up the geographic data and collection sites in SAGE:

For accessions from years 2000 and later:

If Collection Site = 'Earth' or Hort, and Accession Notes has parentheses
- Populate the new Taxonomic Range field (in Collection Object, Locality Info) with the Accession Notes info in parentheses (Taxonomic Range data come from accession table, parenthetical text in accession.accession_notes)
- Field collection place is blank
- Field collection county, Field collection state, Field collection country are blank

If Collection Site = some known non-Earth, non-Hort location
- Taxonomic range information is blank
- Field collection place gets the Collection Site value (which will be controlled by the Place Authority) (Field Collection Place data come from collection_site table, via geog_lexicon.geog_term)
- Field collection county, state, and country will be blank in the cataloging record but will be available via the Place Authority by importing hierarchical Place Authority records.

For accessions from years 1999 and earlier:

If any of the county, state, country fields has parentheses, use data from
- Taxonomic Range data come from Original Accession Geography, the accession_loc table (accession_loc.country_id, accession_loc.state_id, accession_loc.county_id). Strip parentheses, concatenate, and put into the new Taxonomic range field.
- Field collection place is blank
- Field collection county, Field collection state, Field collection country are blank

If all three do not have parentheses, and collection site is not Earth or Hort., use data from Collection Site
- Taxonomic range field is blank
- Field Collection Place data come from collection_site table (geog_lexicon.geog_term) Collection Site
- Field collection county, state, and country will be blank in the cataloging record but will be available via the Place Authority by importing hierarchical Place Authority records.

If all three do not have parentheses; collection site = Earth or Hort.; and there are some legitimate non-range values in country, state, or county
- Taxonomic range field is blank
- Field collection place is blank
- Field collection county, Field collection state, Field collection country get the values from Original Accession Geography (accession_loc)

Note: If collection site = 'hort.' or if country = 'HORT.' then the cultivated checkbox should be checked, and provenance drop down should be "not wild source".

SQL

SELECT
    rtrim(cs.accession_number) objectnumber,
    case when lat_deg is not null then
     convert(varchar,convert(int, lat_deg))||' '||convert(varchar,convert(int,
  lat_min))||' '||convert(varchar,convert(int, lat_sec))||lat_dir
    end as vlatitude,
    case when long_deg is not null then
      convert(varchar,convert(int, long_deg))||' '||convert(varchar,convert(int,
  long_min))||' '||convert(varchar,convert(int, long_sec))||long_dir
    end as vlongitude,
    latitude declatitude,
    longitude declongitude,
    case when UTM_north is not null then
      convert(varchar,UTM_east)||' mE, '||convert(varchar,UTM_north)||' mN'
    end as vcoords,
    elevation velevation,
    lower_elev minelevation,
    upper_elev maxelevation,
    case when lower_elev is not null then 'meters' end as unitselevation,
    case
      when geog_detail is not null and geog_distance is null then geog_detail
      when geog_detail is null and geog_distance is not null then geog_distance
      when geog_detail is not null and geog_distance is not null then geog_detail||';
 '||geog_distance
    end as vlocation,
    habitat_descr,
    veg_type,
    use_range_flag,
    case
      when (country_name not like '%(%' and state_name not like '%(%)%' and county_name not like
 '%(%)%' and cs.geog_lex_id <> 0) then cs.geog_lex_id
      when (cs.geog_lex_id in (2031, 2326)) then 2326
    end as geog_lex,
    case when gl.geog_term = 'hort' then 'hort.'
       else gl.geog_term
    end as geog_term,
    case
      when (country_name like '%(%' or state_name like '%(%)%' or county_name like '%(%)%')
        then rtrim(county_name||', '||state_name||', '||country_name)
    end as taxonomic_range,
    case
      when (country_name not like '%(%' and state_name not like '%(%)%' and county_name not like
 '%(%)%' and cs.geog_lex_id in (0) and (al.country_id not in (0, 442,
  562) or al.state_id <> 0 or al.county_id <> 0)) then country_name
    end as new_country,
    case
      when (country_name not like '%(%' and state_name not like '%(%)%' and county_name not like
 '%(%)%' and cs.geog_lex_id in (0) and (al.country_id not in (0, 442,
  562) or al.state_id <> 0 or al.county_id <> 0) and al.state_id <> 0) then state_name
    end as new_state,
    case
      when (country_name not like '%(%' and state_name not like '%(%)%' and county_name not like
 '%(%)%' and (cs.geog_lex_id < 5953) and (al.country_id not in (0, 442,
  562) or al.state_id <> 0 or al.county_id <> 0) and al.county_id not in (0, 14) ) then county_name
    end as new_county
FROM
    collection_site cs
left outer join dbo.accession_loc al on (al.accession_number = cs.accession_number)
left outer join country_code cc on (al.country_id = cc.country_id)
left outer join state_code sc on (al.state_id = sc.state_id)
left outer join county_code c on (al.county_id = c.county_id)
left outer join geog_lexicon gl on (cs.geog_lex_id = gl.geog_lex_id)
where substring(cs.accession_number, 3, 1) = '.' and cs.accession_number <> '00.0000'
union
SELECT
    rtrim(a.accession_number) objectnumber,
    case when lat_deg is not null then
     convert(varchar,convert(int, lat_deg))||' '||convert(varchar,convert(int,
  lat_min))||' '||convert(varchar,convert(int, lat_sec))||lat_dir
    end as vlatitude,
    case when long_deg is not null then
      convert(varchar,convert(int, long_deg))||' '||convert(varchar,convert(int,
  long_min))||' '||convert(varchar,convert(int, long_sec))||long_dir
    end as vlongitude,
    latitude declatitude,
    longitude declongitude,
    case when UTM_north is not null then
      convert(varchar,UTM_east)||' mE, '||convert(varchar,UTM_north)||' mN'
    end as vcoords,
    elevation velevation,
    lower_elev minelevation,
    upper_elev maxelevation,
    case when lower_elev is not null then 'meters' end as unitselevation,
    case
      when geog_detail is not null and geog_distance is null then geog_detail
      when geog_detail is null and geog_distance is not null then geog_distance
      when geog_detail is not null and geog_distance is not null then geog_detail||';
 '||geog_distance
    end as vlocation,
    habitat_descr,
    veg_type,
    use_range_flag,
    case
      when (cs.geog_lex_id not in (0, 2031) ) then cs.geog_lex_id
      when (cs.geog_lex_id = 2031 ) then 2326
    end as geog_lex,
    case when gl.geog_term = 'hort' then 'hort.'
       else gl.geog_term
    end as geog_term,
    case
      when (cs.geog_lex_id in (0, 2031, 2326) and accession_notes like '%(%)%')
        then substring(
        substring(
                a.accession_notes,
                (patindex('%(%)%', a.accession_notes)) + 1,
                char_length(a.accession_notes)),
        1,
        (patindex('%)%', substring(
                a.accession_notes,
                (patindex('%(%)%', a.accession_notes)) + 1,
                char_length(a.accession_notes)))) - 1
    )
    end as taxonomic_range,
    case
      when (country_name not like '%(%' and state_name not like '%(%)%' and county_name not like
 '%(%)%' and cs.geog_lex_id =0 and (al.country_id not in (0, 442,
  562) or al.state_id <> 0 or al.county_id <> 0)) then country_name
    end as new_country,
    case
      when (country_name not like '%(%' and state_name not like '%(%)%' and county_name not like
 '%(%)%' and cs.geog_lex_id =0 and (al.country_id not in (0, 442,
  562) or al.state_id <> 0 or al.county_id <> 0) and al.state_id <> 0) then state_name
    end as new_state,
    case
      when (county_name not like '%(%)%' and county_name is not null and al.county_id <> 0 and
 cs.geog_lex_id < 5953) then county_name
    end as new_county
FROM
    dbo.accession a
left outer join accession_loc al on (a.accession_number = al.accession_number)
left outer join collection_site cs on (a.accession_number = cs.accession_number)
left outer join geog_lexicon gl on (cs.geog_lex_id = gl.geog_lex_id)
left outer join country_code cc on (al.country_id = cc.country_id)
left outer join state_code sc on (al.state_id = sc.state_id)
left outer join county_code c on (al.county_id = c.county_id)
where substring(a.accession_number, 5, 1) = '.'
order by objectnumber

Notes

Sites can have hort and range..

County, state, country only used in a small handful of cases where they provided useful information not in collection site or taxonomic range. Concatenated onto verbatim locality field.

Talend/ETL

ucbgcollectionsite 0.6 in Chris's Talend repository botgarden

MERGE: Use of Accessions (repeating group) - SQL ready (Chris), Talend dev-sample data ready (Chris)

Research projects and classes

Field mapping

SQL

SELECT
    rtrim(accession_number) accession_number,
    'research' usetype,
    'researchproject'||convert(varchar(20), rb.project_id) conceptShortID,
    rp.project_name conceptName,
    request_date,
    filled_date,
    material_type,
    notes
FROM
    dbo.research_plant rb
join research_project rp on (rp.project_id = rb.project_id)
union
SELECT
    rtrim(accession_number) accession_number,
    'class' usetype,
    'class'||convert(varchar(20), cp.course_id) conceptShortID,
    cu.course_title conceptName,
    cp.last_change_date request_date,
    null filled_date,
    null material_type,
    concept notes
FROM
    dbo.class_plant cp
join class_use cu on (cp.course_id=cu.course_id)
order by accession_number, request_date

Notes

URN format for materialType: <materialType>urn:cspace:botgarden.cspace.berkeley.edu:vocabularies:name(materialtype):item:name(bulb)'bulb'</materialType>

Talend/ETL

ucbgaccessionuse 0.1 in Chris's Talend repository botgarden

Procedures

Plant Locations (LMI procedure, including current location and history/version) and Relationships to Objects - SQL ready (Chris), Talend (Yuteh)

See attached accessions map, current_location and accession_history tables

Field mapping

SQL

Initially, this SQL just gets the current location records, plant_location table.

SELECT
    null record_id,
    rtrim(accession_number) accession_number,
    'plant_location' datasource,
    action_date,
    case when action_code = 0 then 'Dead'
     when action_code = 1 then 'Planted Out'
     when action_code = 2 then 'Moved'
     when action_code = 3 then 'Label Request'
     when action_code = 4 then 'Theft'
     when action_code = 5 then 'Other'
     when action_code = 6 then 'Revived'
    end as reasonForMove,
    null previous_location,
    'garden'||convert(varchar(5), garden_loc_id) current_location,
    case when label_req_flag = 1 then 'true' else 'false' end as label_request,
    label_size,
    stand_type,
    label_count,
    comments,
    entered_staff_id,
    date_entered,
    last_change_staff_id,
    last_change_date
FROM
    dbo.plant_location
where accession_number <> '00.0000'
order by accession_number, datasource desc, action_date desc

Notes

Accession History records are being loaded as versioned LMI records. We still need to know how to do that.

Talend/ETL

Uses output from ucbgaccessions Talend job that contains CSIDs and Object Numbers. Rename most recent output file to ucbg-accessions-ids.txt

Also uses a file sage_createupdateby_lookup.txt that contains a few user email addresses to decode the staff ID for the createdBy and updateBy info

ucbgplantlocations 0.1

Run steps

Movements: Change "schema2" to "schema"
Movements: Import
Move-to-object relationships: Import
Object-to-move relationships: Import

Vouchers and Voucher-Object relationships - SQL ready (Chris), Talend dev (Lam)

Field mapping

Using Loan Out for Voucher

SQL

SELECT
    voucher_id, -- loanOutNumber
    accession_number, -- for relationship to Accession
    voucher_number voucherInstitution, -- get refname for borrower
    case when h_w_code='h' then 'Horticultural'
         when h_w_code='w' then 'Wild'
    end as hortWild, -- decode to h Horticultural or w Wild
    vouchered_date,
    case when voucher_early_date>2415021 and voucher_early_date<2556233 and vouchered_date not like ('%.%')
          then dateadd(dd,(voucher_early_date - 2415021), convert(date, '1900-01-01', 103))
    end as voucherDate, -- loanOutDate
    voucher_early_date, -- ??
    voucher_late_date, -- ??
    case when vouchered_by <> 0 then vouchered_by end as vouchered_by,
    case when garden_loc_id <> 0 then garden_loc_id end as garden_loc_id, -- get refname for garden loc
    label_count numSheets, -- numLent
    smasch_flag, -- skip
    case when flowering_flag=1 then 'true' else 'false' end as flowering, -- case 1=true
    case when fruiting_flag=1 then 'true' else 'false' end as fruiting, -- case 1=true
    case when sterile_flag=1 then 'true' else 'false' end as sterile, -- case 1=true
    entered_staff_id createdBy,
    date_entered createdAt,
    last_change_staff_id updatedBy,
    last_change_date updatedAt
FROM
    dbo.voucher

Notes

Still need to get UCJEPS accession number for some number of vouchers

Uses ID output from job that creates garden locations vocabulary. Make sure file is renamed to ucbg-gardenloc-ids.txt
Uses ID output from job that creates accessions. Make sure that file is renamed to ucbg-accessions-ids.txt

Talend/ETL

ucbgvoucher 0.1

Run steps

Vouchers: Change "schema2" and "schema3" to "schema"
Vouchers: Import
Voucher-to-object relationships: Import
Object-to-voucher relationships: Import

Propagation SQL ready (Chris), REST Import development (John)

Field mapping

SQL

Control lists:

Pot Type: select pot_type from pot_type_code order by pot_type;
Pot Size: select pot_size from pot_size_code order by pot_size;
Treatment Type: select treatment_type_name from treatment_type order by treatment_type_name;

Notes

See Botanical propagation data analysis

Prop Type:

lookup table: prop_type_code
propagation.prop_type = pro_type_code.prop_type
21 values in prop_type_code

Reason for Prop: propagation.purpose

database has a rule that limits the values to: 'class use', 'research', 'garden collection',
'distribution', 'conservation'
5 distinct values in propagation.purpose

Extra Seeds: propagation.extra_seed_flag

boolean (0,1)

Activity Type:

lookup table: activity_type_code
activity_type_code.activity_type where growth_history.activity_type_id = activity_type_code.activity_type_id
24 values in activity_type_code

Medium:

lookup table: medium
growth_history.medium_name = medium.medium_name
1302 values in medium

Pot Type:d

lookup table: pot_type
growth_history.pot_type = pot_type_code.pot_type
35 values in pot_type_code

Pot Size:

lookup table: pot_size_code
pot_size_code.pot_size where growth_history.pot_size_id = pot_size_code.pot_size_id
69 values in pot_size_code

Treatment Type:

lookup table: treatment_type
treatment_type.treatment_type_name where treatment.treatment_type_id = treatment_type.treatment_type_id
91 values in treatment_type

Concentration: treatment.concentration

varchar(10) field
no rules or constraints on this field
80 distinct values in treatment.concentration

Talend/ETL

MERGE: Propagation Activities (repeating group) SQL ready (Chris), REST import (John)

Field mapping

SQL

Notes

See Botanical propagation data analysis

Relationships between Accessions and Propagations; REST import (John)

Field mapping

SQL

Notes

Talend/ETL

Distributions and Distribution-Accession relationships (using Object Exit procedure) - SQL dev (Chris), Talend dev (Chris), Data botgarden-dev loaded (Chris)

Field mapping

SQL

SELECT
    rtrim(accession_number) accession_number, -- for relations
    e.org_id, -- Exchange Org, get refname
    o.org_name,
    convert(date, exchange_date) exchangeDate, -- for display date
    exchange_date, -- for scalar dates
    datepart(year, exchange_date) as exchange_date_year,
    datepart(month, exchange_date) as exchange_date_month,
    datepart(day, exchange_date) as exchange_date_day,
    plant_count,
    e.entered_staff_id,
    e.date_entered,
    exchange_comments
FROM
    dbo.exchange e,
    organization o
    where e.org_id=o.org_id

Notes

Uses ID output from job that creates accessions. Make sure that file is renamed to ucbg-accessions-ids.txt

Talend/ETL

ucbgdistributions 0.1

Run steps

Distributions: Change "schema2" to "schema"
Distributions: Import
Distribution-to-object relationships: Import
Object-to-distribution relationships: Import

Pot Tags (Volunteer Labels) - Lam (SQL), Chris (Talend)

Field mapping

SQL

select
    vl.label_id as legacyId,
    vl.accession_number as accessionNumber,
    vl.formatted_taxon as taxonName,
    vl.common_name as commonName,
    vl.label_data as labelData,
    vl.family as family,
    vl.country_name as locale,
    vl.quantity as numberOfLabels,
    case when vl.label_req_flag = 0 then 'no'
         when vl.label_req_flag = 1 then 'yes'
    end as printLabels
from
    volunteer_label vl

Notes

Talend/ETL

ucbgpottags 0.1 creates pot tag records and relationship records to collection objects