CineFiles Data Mapping
Notes, queries, and mapping documents related to the migration of CineFiles data into CollectionSpace will be documented here.
Data mapping: See the Data Mapping Google Doc for field-based mapping
BAMPFA-1: Jira task for ER diagrams, spreadsheets for field mapping, etc.
Initial observations based on team discussion on 20 June 2013
Authorities
Citations - Authority Source vocabulary - SQL: Chris, Talend: Chris
Create new citation vocabulary to hold authority source terms
Field mapping
SQL
Using a CSV file exported from dbVisualizer as the data source (14 records)
Notes
Talend/ETL
Person - default person vocabulary - SQL: Glen, Talend: Glen
We will have one person vocabulary, mapping the names from Sybase that are person records.
There might be some need to produce merge parts, for instance if we need to create, preferred and non-preferred terms for individual records.
Field mapping
SQL
SELECT name_id, ( convert( varchar(15), names.name_id), '') as shortIdentifier, ISNULL( fname, '' ) as foreName, ISNULL( mname, '' ) as middleName, lname as surName, ISNULL( gen_suffix,'' ) as nameAddition, fname + substring(' ', sign(datalength(fname)), 1) + mname + substring(' ', sign(datalength(mname)), 1) + lname + substring(' ', sign(datalength(gen_suffix)), 1) + gen_suffix as displayName, ISNULL(convert( char(4), datepart( year, dob )), '' ) as birthYear, dob as earliestScalarBirthDate, dateadd( second, -1, dateadd( year, 1, dob )) as latestScalarBirthDate, ISNULL(convert( char(4), datepart( year, dod )), '' ) as deathYear, dod as earliestScalarDeathDate, dateadd( second, -1, dateadd( year, 1, dod )) as latestScalarDeathhDate, ISNULL( city, '' ) as city, ISNULL( state, '' ) as state, cntry_id, ISNULL( biog, '' ) as bioNote, name_type, ISNULL( convert( char(1), code ), '' ) as accessCode, entered, modified, ISNULL( entered_by, '' ) as enteredBy, ISNULL( modified_by, '' ) as modifiedBy, ISNULL( verified_by, '' ) as verifiedBy, ISNULL( src_id, '' ) as authSource, ISNULL( note, '' ) as nameNote, ISNULL( url, '' ) as URL FROM names WHERE name_type = 1 AND name_id > 1
Notes
Used ISNULL function everywhere that it made sense to avoid the endless checking for nulls in Talend.
Aliased files names to match the import tags they will be mapped to.
Name recort 1 is a dummy record so it is excluded.
Note: Access Code mapping. These are the option values mapped in to CollectionSpace:
0 <option id="PFA Staff Only">PFA Staff Only</option>
1 <option id="In House Only">In House Only</option>
2 <option id="Campus (UCB)">Campus (UCB)</option>
3 <option id="Education (.edu)">Education (.edu)</option>
4 <option id="World">World</option>
5 World
Note that 5 exists in the database as a convenience but basically also maps to World. See http://issues.collectionspace.org/browse/BAMPFA-36 for more information.
Talend/ETL
MERGE: Person names - SQL: Glen, Talend: Glen
In order to incorporate primary and alternate names.
Field mapping
SQL
SELECT name_id, 0 as altname_id, ISNULL( convert( varchar( 15 ), name_id ), '' ) as shortIdentifier, ISNULL( fname, '' ) as foreName, ISNULL( mname, '' ) as middleName, lname as surName, ISNULL( gen_suffix, '' ) as nameAddition, fname + substring( ' ', sign( datalength( fname ) ), 1 ) + mname + substring( ' ', sign( datalength( mname ) ), 1 ) + lname + substring( ' ', sign( datalength( gen_suffix ) ), 1 ) + gen_suffix as displayName, '' as srcID, '' as nameNote FROM names WHERE name_type = 1 AND name_id > 1 UNION SELECT a.name_id, a.altname_id, ISNULL( convert( varchar( 15 ), a.name_id ), '' ) as shortIdentifier, ISNULL( a.fname, '' ) as foreName, ISNULL( a.mname, '' ) as middleName, a.lname as surName, ISNULL( a.gen_suffix, '' ) as nameAddition, a.fname + substring( ' ', sign( datalength( a.fname ) ), 1 ) + a.mname + substring( ' ', sign( datalength( a.mname ) ), 1 ) + a.lname + substring( ' ', sign( datalength( a.gen_suffix ) ), 1 ) + a.gen_suffix as displayName, ISNULL( a.src_id, '' ) as srcID, ISNULL( a.note, '' ) as nameNote FROM altnames a, names n HAVING ( n.name_type = 1 AND n.name_id = a.name_id AND n.name_id > 1 ) ORDER BY name_id, altname_id
Notes
Includes preferred and non-preferred names in one query. A '0' value for altname_id indicates a preferred name.
Talend/ETL
Organization - corporate names vocabulary - SQL: Glen, Talend: Glen
Use the default org vocabulary for Sybase corporate names
There might be some need to produce merge parts, for instance if we need to create, preferred and non-preferred terms for individual records.
Field mapping
SQL
SELECT name_id, ISNULL( convert( varchar(15), names.name_id), '') as shortIdentifier, lname as displayName, ISNULL( city, '' ) as foundingCity, ISNULL( state, '' ) as foundingState, cntry_id, ISNULL( convert( char(4), datepart( year, dob )), '' ) as foundingYear, ISNULL( dob, NULL ) as earliestScalarBirthDate, dateadd( second, -1, dateadd( year, 1, dob )) as latestScalarBirthDate, ISNULL( biog, '' ) as bioNote, ISNULL( note, '' ) as nameNote, ISNULL( src_id, '' ) as authSourceID, ISNULL( convert( char(1), code ), '' ) as accessCode, ISNULL( url, '' ) as URL, entered, modified, ISNULL( entered_by, '' ) as enteredBy, ISNULL( modified_by, '' ) as modifiedBy FROM names WHERE name_type = 2 AND name_id > 1 ORDER BY name_id
Notes
Talend/ETL
Organization - committees vocabulary - SQL: Chris, Talend: Chris
Create a new org vocabulary for Sybase committee names
Field mapping
SQL
SELECT name_id, lname, name_type, code, src_id, note, entered, entered_by, modified, modified_by FROM names where name_type=3 order by lname
Notes
Talend/ETL
MERGE: Corporate names - SQL: Glen, Talend: Glen
In order to incorporate primary and alternate names.
Field mapping
SQL
SELECT name_id, 0 as altname_id, ISNULL( convert( varchar( 15 ), name_id ), '' ) as shortIdentifier, ISNULL( fname, '' ) as foreName, ISNULL( mname, '' ) as middleName, lname as surName, ISNULL( gen_suffix, '' ) as nameAddition, fname + substring( ' ', sign( datalength( fname ) ), 1 ) + mname + substring( ' ', sign( datalength( mname ) ), 1 ) + lname + substring( ' ', sign( datalength( gen_suffix ) ), 1 ) + gen_suffix as displayName, '' as srcID, '' as nameNote FROM names WHERE name_type = 1 AND name_id > 1 UNION SELECT a.name_id, a.altname_id, ISNULL( convert( varchar( 15 ), a.name_id ), '' ) as shortIdentifier, ISNULL( a.fname, '' ) as foreName, ISNULL( a.mname, '' ) as middleName, a.lname as surName, ISNULL( a.gen_suffix, '' ) as nameAddition, a.fname + substring( ' ', sign( datalength( a.fname ) ), 1 ) + a.mname + substring( ' ', sign( datalength( a.mname ) ), 1 ) + a.lname + substring( ' ', sign( datalength( a.gen_suffix ) ), 1 ) + a.gen_suffix as displayName, ISNULL( a.src_id, '' ) as srcID, ISNULL( a.note, '' ) as nameNote FROM altnames a, names n HAVING ( n.name_type = 1 AND n.name_id = a.name_id AND n.name_id > 1 ) ORDER BY name_id, altname_id
Notes
Talend/ETL
MERGE: Committee Members for committees (org vocabulary) - SQL: Chris, Talend: Chris
Field mapping
SQL
SELECT cm.name_id committeeId, rtrim(cm.src_id) src_id, p.fname + substring(' ', sign(datalength(p.fname)), 1) + p.mname + substring(' ', sign(datalength(p.mname)), 1) + p.lname + substring(' ', sign(datalength(p.gen_suffix)), 1) + p.gen_suffix as member_displayName, ISNULL( convert( varchar(15), p.name_id), '') as member_shortIdentifier FROM committeemembers cm, names c, names p where cm.name_id=c.name_id and cm.member_id=p.name_id and c.name_type=3 and cm.member_id<>1 order by c.name_id, member_displayName
Notes
Talend/ETL
Contact schema records for Organization - corporate names vocabulary - SQL: Glen, Talend: Glen
We do have URLs for approximately 200 corporate name records. Those are imported as a separate job with a pointer to the org record that the contact info should be associated with. See http://wiki.collectionspace.org/display/collectionspace/Organization+Service+REST+APIs#OrganizationServiceRESTAPIs-ContactCRUDLservices for more information.
Field mapping
SQL
...
Notes
Talend/ETL
Concept - genre vocabulary - SQL: Chris, Talend: Chris
Use default concept vocabulary to hold genres
Field mapping
SQL
SELECT g.genre_id, 'genre'||convert(varchar, genre_id) shortid, g.genre, rtrim(src_id) authsource, rtrim(note) note FROM genres g where genre_id <> 1 order by genre
Notes
Talend/ETL
Concept - subject vocabulary - SQL: Chris, Talend: Chris
Create new concept vocabulary to hold subject terms
Field mapping
SQL
SELECT subj_id, 'subject'||convert(varchar, subj_id) shortid, subj, rtrim(src_id), note FROM subjects where subj_id <> 1 order by subje
Notes
Talend/ETL
MERGE: Alternate genre terms - SQL: Chris, Talend: Chris
Field mapping
SQL
SELECT g.genre_id, 'genre'||convert(varchar, genre_id) shortid, g.genre, rtrim(src_id) authsource, rtrim(note) note, 0 sortid FROM genres g where genre_id <> 1 union select gl.genre_id, 'genre'||convert(varchar, genre_id) shortid, gl.genre_term, null authsource, null note, gl.altgenre_id sortid from genres_list gl where altgenre_id <> 0 order by genre_id, sortid
Notes
Talend/ETL
Works - Film Vocabulary - SQL: Lam, Talend: Lam
Use default Works vocabulary to hold films
Merge groups will be needed
Field mapping
SQL
select    f.film_id as legacy_film_id,    'pfafilm' + convert(varchar(10), f.film_id) as short_identifier,    substring(        substring(            f.prefix + ' ', 1,            datalength( f.prefix + ' ' ) -            sign( patindex( '%[''-]', f.prefix ))),            sign( patindex( '%[^^I ]%', f.prefix )), 10 ) + f.title as display_name,    'urn:cspace:cinefiles.cspace.berkeley.edu:vocabularies:name(worktype):item:name(film)''Film''' as work_type,    f.note as history_notes,    f.entered as created_at,    case        when f.entered_by = 'lt' then 'Linda Tadic'        else            eb.fname + substring( ' ', sign( datalength( eb.fname ) ), 1 )            + eb.mname + substring( ' ', sign( datalength( eb.mname ) ), 1 )            + eb.lname    end as created_by,    f.modified as updated_at,    case        when f.modified_by = 'lt' then 'Linda Tadic'        else            mb.fname + substring( ' ', sign( datalength( mb.fname ) ), 1 )            + mb.mname + substring( ' ', sign( datalength( mb.mname ) ), 1 )            + mb.lname    end as updated_by,    'urn:cspace:cinefiles.cspace.berkeley.edu:workauthorities:name(work):item:name(pfafilm'        + convert(varchar(10), f.film_id) + ')'''        + substring(        substring(            f.prefix + ' ', 1,            datalength( f.prefix + ' ' ) -            sign( patindex( '%[''-]', f.prefix ))),            sign( patindex( '%[^^I ]%', f.prefix )), 10 ) + f.title        + '''' as refname from films f, pfastaff eb, pfastaff mb where f.entered_by *= eb.initials and f.modified_by *= mb.initials and f.film_id > 1 order by legacy_film_id;
Notes
Talend/ETL
MERGE: Work Term Names (Film Titles) - SQL: Lam, Talend: Lam
Order: Original titles, English Titles, Alternate Titles
Encountered problem with $ in refname. Four records that contain the $ character in the title failed to load. Glen's investigation pinpointed the problem to the $ character in the refname. The occurrence of the $ in other fields seem to be okay. Glen substituted '$' for the $ character, and loading still failed. Removing the $ from the refname, but leaving the $ in other fields resulted in a successful load.
Field mapping
SQL
-- film term names -- from original titles from films table -- union English titles from films table -- union alternate titles from alttitles table -- sort order: original, English, alternate select    f.film_id as legacy_film_id,    'pfafilm' + convert(varchar(10), f.film_id) as short_identifier,    case when f.prefix is null or f.prefix = '' then f.title        else substring(            substring(                f.prefix + ' ', 1,                datalength( f.prefix + ' ' ) -                sign( patindex( '%[''-]', f.prefix ))),                sign( patindex( '%[^^I ]%', f.prefix )), 10 ) + f.title    end as term_display_name,    ltrim(rtrim(f.title)) as term_name,    ltrim(rtrim(f.prefix)) as term_qualifier,    'Original Title' as term_type,    'complete' as term_status,    null as term_language,    1 as preferred_flag,    1 as pos,    ltrim(rtrim(f.src_id)) as term_source_id,    'urn:cspace:cinefiles.cspace.berkeley.edu:citationauthorities:name(citation):item:name('        + f.src_id + ')''' + a.source + '''' as term_source_refname,    null as term_source_note from    films f,    authsources a where f.src_id *= a.src_id and f.film_id > 1 union select    f.film_id as legacy_film_id,    'pfafilm' + convert(varchar(10), f.film_id) as short_identifier,    case when f.eng_prefix is null or f.eng_prefix = '' then f.eng_title        else substring(            substring(                f.eng_prefix + ' ', 1,                datalength( f.eng_prefix + ' ' ) -                sign( patindex( '%[''-]', f.eng_prefix ))),                sign( patindex( '%[^^I ]%', f.eng_prefix )), 10 ) + f.eng_title    end as term_display_name,    ltrim(rtrim(f.eng_title)) as term_name,    ltrim(rtrim(f.eng_prefix)) as term_qualifier,    'English Title' as term_type,    'complete' as term_status,    'urn:cspace:cinefiles.cspace.berkeley.edu:vocabularies:name(languages):item:name(ENG)''English''' as term_language,    0 as preferred_flag,    2 as pos,    ltrim(rtrim(f.src_id)) as term_source_id,    'urn:cspace:cinefiles.cspace.berkeley.edu:citationauthorities:name(citation):item:name('        + f.src_id + ')''' + a.source + '''' as term_source_refname,    null as term_source_note from    films f,    authsources a where f.src_id *= a.src_id and f.eng_title is not null and f.eng_title != '' and f.film_id > 1 union select    a.film_id as legacy_film_id,    'pfafilm' + convert(varchar(10), a.film_id) as short_identifier,    case when a.prefix is null or a.prefix = '' then a.title        else substring(            substring(                a.prefix + ' ', 1,                datalength( a.prefix + ' ' ) -                sign( patindex( '%[''-]', a.prefix ))),                sign( patindex( '%[^^I ]%', a.prefix )), 10 ) + a.title    end as term_display_name,    ltrim(rtrim(a.title)) as term_name,    ltrim(rtrim(a.prefix)) as term_qualifier,    'Alternate Title' as term_type,    'complete' as term_status,    null as term_language,    0 as preferred_flag,    3 as pos,    ltrim(rtrim(a.src_id)) as term_source_id,    'urn:cspace:cinefiles.cspace.berkeley.edu:citationauthorities:name(citation):item:name('        + a.src_id + ')''' + auth.source + '''' as term_source_refname,    a.note as term_source_note from    alttitles a,    authsources auth where a.src_id *= auth.src_id and a.title is not null and a.title != '' and a.film_id > 1 and a.alttitle_id not in (    select alt.alttitle_id from alttitles alt, films f    where substring(        substring(            alt.prefix + ' ', 1,            datalength( alt.prefix + ' ' ) -            sign( patindex( '%[''-]', alt.prefix ))),            sign( patindex( '%[^^I ]%', alt.prefix )), 10 ) + alt.title    = substring(        substring(            f.prefix + ' ', 1,            datalength( f.prefix + ' ' ) -            sign( patindex( '%[''-]', f.prefix ))),            sign( patindex( '%[^^I ]%', f.prefix )), 10 ) + f.title    and alt.film_id = f.film_id) and a.alttitle_id not in (    select alt.alttitle_id from alttitles alt, films f    where substring(        substring(            alt.prefix + ' ', 1,            datalength( alt.prefix + ' ' ) -            sign( patindex( '%[''-]', alt.prefix ))),            sign( patindex( '%[^^I ]%', alt.prefix )), 10 ) + alt.title    = substring(        substring(            f.eng_prefix + ' ', 1,            datalength( f.eng_prefix + ' ' ) -            sign( patindex( '%[''-]', f.eng_prefix ))),            sign( patindex( '%[^^I ]%', f.eng_prefix )), 10 ) + f.eng_title    and alt.film_id = f.film_id) order by legacy_film_id, pos;
Notes
Talend/ETL
MERGE: Work Creators (Film Directors) - SQL: Lam, Talend: Lam
Creators = Directors
Field mapping
SQL
select    f.film_id as legacy_film_id,    'pfafilm' + convert(varchar(10), f.film_id) as short_identifier,    f.name_id as legacy_name_id,    ltrim(rtrim(case        when n.name_type = 1 then            n.fname + substring(' ', sign(datalength(n.fname)), 1) +            n.mname + substring(' ', sign(datalength(n.mname)), 1) +            n.lname + substring(' ', sign(datalength(n.gen_suffix)), 1) + n.gen_suffix        when n.name_type = 2 then n.lname        when n.name_type = 3 then n.lname    end)) as creator_display_name,    'urn:cspace:cinefiles.cspace.berkeley.edu:vocabularies:name(workcreatortype):item:name(director)''Director''' as creator_type,    case        when n.name_type = 1 then            'urn:cspace:cinefiles.cspace.berkeley.edu:personauthorities:name(person):item:name(person'            + convert(varchar(10), n.name_id) + ')'''            + ltrim(rtrim(n.fname + substring(' ', sign(datalength(n.fname)), 1)                + n.mname + substring(' ', sign(datalength(n.mname)), 1)                + n.lname + substring(' ', sign(datalength(n.gen_suffix)), 1) + n.gen_suffix))            + ''''        when n.name_type = 2 then            'urn:cspace:cinefiles.cspace.berkeley.edu:orgauthorities:name(committee):item:name(committee'            + convert(varchar(10), n.name_id) + ')'''            + ltrim(rtrim(n.lname)) + ''''        when n.name_type = 3 then            'urn:cspace:cinefiles.cspace.berkeley.edu:orgauthorities:name(corporation):item:name(organization'            + convert(varchar(10), n.name_id) + ')'''            + ltrim(rtrim(n.lname)) + ''''    end as creator_refname from    films f,    names n where f.name_id *= n.name_id and f.film_id > 1 and f.name_id > 1 union select    fd.film_id as legacy_film_id,    'pfafilm' + convert(varchar(10), fd.film_id) as short_identifier,    fd.name_id as legacy_name_id,    ltrim(rtrim(case        when n.name_type = 1 then            n.fname + substring(' ', sign(datalength(n.fname)), 1) +            n.mname + substring(' ', sign(datalength(n.mname)), 1) +            n.lname + substring(' ', sign(datalength(n.gen_suffix)), 1) + n.gen_suffix        when n.name_type = 2 then n.lname        when n.name_type = 3 then n.lname    end)) as creator_display_name,    'urn:cspace:cinefiles.cspace.berkeley.edu:vocabularies:name(workcreatortype):item:name(director)''Director''' as creator_type,    case        when n.name_type = 1 then            'urn:cspace:cinefiles.cspace.berkeley.edu:personauthorities:name(person):item:name(person'            + convert(varchar(10), n.name_id) + ')'''            + ltrim(rtrim(n.fname + substring(' ', sign(datalength(n.fname)), 1)                + n.mname + substring(' ', sign(datalength(n.mname)), 1)                + n.lname + substring(' ', sign(datalength(n.gen_suffix)), 1) + n.gen_suffix))            + ''''        when n.name_type = 2 then            'urn:cspace:cinefiles.cspace.berkeley.edu:orgauthorities:name(committee):item:name(committee'            + convert(varchar(10), n.name_id) + ')'''            + ltrim(rtrim(n.lname)) + ''''        when n.name_type = 3 then            'urn:cspace:cinefiles.cspace.berkeley.edu:orgauthorities:name(corporation):item:name(organization'            + convert(varchar(10), n.name_id) + ')'''            + ltrim(rtrim(n.lname)) + ''''    end as creator_refname from    filmdirectors fd,    names n where fd.name_id *= n.name_id and fd.film_id > 1 and fd.name_id > 1 order by legacy_film_id, creator_display_name;
Notes
Talend/ETL
MERGE: Work Publishers (Film Production Companies) - SQL: Lam, Talend: Lam
Publishers = Production Companies
Field mapping
SQL
select distinct    fp.film_id as legacy_film_id,    'pfafilm' + convert(varchar(10), fp.film_id) as short_identifier,    fp.name_id as legacy_name_id,    ltrim(rtrim(case        when n.name_type = 1 then            n.fname + substring(' ', sign(datalength(n.fname)), 1) +            n.mname + substring(' ', sign(datalength(n.mname)), 1) +            n.lname + substring(' ', sign(datalength(n.gen_suffix)), 1) + n.gen_suffix        when n.name_type = 2 then n.lname        when n.name_type = 3 then n.lname    end)) as publisher_display_name,    'urn:cspace:cinefiles.cspace.berkeley.edu:vocabularies:name(workpublishertype):item:name(productioncompany)''Production Company''' as publisher_type,    case        when n.name_type = 1 then            'urn:cspace:cinefiles.cspace.berkeley.edu:personauthorities:name(person):item:name(person'            + convert(varchar(10), n.name_id) + ')'''            + ltrim(rtrim(n.fname + substring(' ', sign(datalength(n.fname)), 1)                + n.mname + substring(' ', sign(datalength(n.mname)), 1)                + n.lname + substring(' ', sign(datalength(n.gen_suffix)), 1) + n.gen_suffix))            + ''''        when n.name_type = 2 then            'urn:cspace:cinefiles.cspace.berkeley.edu:orgauthorities:name(committee):item:name(committee'            + convert(varchar(10), n.name_id) + ')'''            + ltrim(rtrim(n.lname)) + ''''        when n.name_type = 3 then            'urn:cspace:cinefiles.cspace.berkeley.edu:orgauthorities:name(corporation):item:name(organization'            + convert(varchar(10), n.name_id) + ')'''            + ltrim(rtrim(n.lname)) + ''''    end as publisher_refname from    filmprodcos fp,    names n where fp.name_id *= n.name_id and fp.film_id > 1 and fp.name_id > 1 order by legacy_film_id, publisher_display_name;
Notes
Talend/ETL
MERGE: Work Dates (Film Years) - SQL: Lam, Talend: Lam
Field mapping
SQL
select    f.film_id as legacy_film_id,    'pfafilm' + convert(varchar(10), f.film_id) as short_identifier,    f.year as display_date,    convert(varchar(4), f.year) + '-01-01T00:00:00Z' as earliest_scalar,    f.year as earliest_year,    1 as earliest_month,    1 as earliest_day,    convert(varchar(4), f.year) + '-12-31T23:59:59Z' as latest_scalar,    f.year as latest_year,    12 as latest_month,    31 as latest_day from    films f where f.film_id > 1 and f.year is not null union select    fy.film_id as legacy_film_id,    'pfafilm' + convert(varchar(10), fy.film_id) as short_identifier,    fy.year as display_date,    convert(varchar(4), fy.year) + '-01-01T00:00:00Z' as earliest_scalar,    fy.year as earliest_year,    1 as earliest_month,    1 as earliest_day,    convert(varchar(4), fy.year) + '-12-31T23:59:59Z' as latest_scalar,    fy.year as latest_year,    12 as latest_month,    31 as latest_day from    filmyears fy where fy.film_id > 1 and fy.year is not null order by legacy_film_id, display_date;
Notes
Talend/ETL
MERGE: Work Languages (Film Languages) - SQL: Lam, Talend: Lam
Field mapping
SQL
select    f.film_id as legacy_film_id,    'pfafilm' + convert(varchar(10), f.film_id) as short_identifier,    f.lang_id as legacy_lang_id,    l.lang_tla as language_code,    ltrim(rtrim(l.lang)) as language_name,    'urn:cspace:cinefiles.cspace.berkeley.edu:vocabularies:name(languages):item:name(' +        l.lang_tla + ')''' + ltrim(rtrim(l.lang)) + '''' as language_refname from    films f,    languages l where f.lang_id = l.lang_id and f.film_id > 1 and l.lang_id > 1 union select    fl.film_id as legacy_film_id,    'pfafilm' + convert(varchar(10), fl.film_id) as short_identifier,    fl.lang_id as legacy_lang_id,    l.lang_tla as language_code,    ltrim(rtrim(l.lang)) as language_name,    'urn:cspace:cinefiles.cspace.berkeley.edu:vocabularies:name(languages):item:name(' +        l.lang_tla + ')''' + ltrim(rtrim(l.lang)) + '''' as language_refname from    filmlangs fl,    languages l where fl.lang_id = l.lang_id and fl.film_id > 1 and l.lang_id > 1 order by legacy_film_id, language_name;
Notes
Talend/ETL
MERGE: Work Countries (Film Countries) - SQL: Lam, Talend: Lam
Field mapping
SQL
select    f.film_id as legacy_film_id,    'pfafilm' + convert(varchar(10), f.film_id) as short_identifier,    f.cntry_id as legacy_country_id,    c.cntry_code as country_code,    ltrim(rtrim(c.country)) as country_name,    'urn:cspace:cinefiles.cspace.berkeley.edu:vocabularies:name(country):item:name(' +        ltrim(rtrim(c.cntry_code)) + ')''' + ltrim(rtrim(c.country)) + '''' as country_refname from    films f,    countries c where f.cntry_id = c.cntry_id and f.film_id > 1 and c.cntry_id > 1 union select    fc.film_id as legacy_film_id,    'pfafilm' + convert(varchar(10), fc.film_id) as short_identifier,    fc.cntry_id as legacy_country_id,    c.cntry_code as country_code,     ltrim(rtrim(c.country)) as country_name,    'urn:cspace:cinefiles.cspace.berkeley.edu:vocabularies:name(country):item:name(' +        ltrim(rtrim(c.cntry_code)) + ')''' + ltrim(rtrim(c.country)) + '''' as country_refname from    filmcountries fc,    countries c where fc.cntry_id = c.cntry_id and fc.film_id > 1 and c.cntry_id > 1 order by legacy_film_id, country_name;
Notes
Talend/ETL
MERGE: Work Genres (Film Genres) - SQL: Lam, Talend: Lam
Repeating: concept-genre vocabulary
Field mapping
SQL
select    fg.film_id as legacy_film_id,    'pfafilm' + convert(varchar(10), fg.film_id) as short_identifier,    fg.genre_id as legacy_genre_id,    ltrim(rtrim(g.genre)) as genre_name,    'urn:cspace:cinefiles.cspace.berkeley.edu:conceptauthorities:name(genre):item:name(genre'        + convert(varchar(4), fg.genre_id) + ')'''        + ltrim(rtrim(g.genre))        + '''' as genre_refname from    filmgenres fg,    genres g where fg.genre_id = g.genre_id and fg.film_id > 1 and g.genre_id > 1 order by legacy_film_id, genre_name;
Notes
Talend/ETL
MERGE: Work Subjects (Film Subjects) - SQL: Lam, Talend: Lam
Repeating: concept-subject (default) vocabulary
Field mapping
SQL
select    fs.film_id as legacy_film_id,    'pfafilm' + convert(varchar(10), fs.film_id) as short_identifier,    fs.subj_id as legacy_subj_id,    ltrim(rtrim(s.subj)) as subject_name,    'urn:cspace:cinefiles.cspace.berkeley.edu:conceptauthorities:name(subject):item:name(subject'        + convert(varchar(6), fs.subj_id) + ')'''        + ltrim(rtrim(s.subj))        + '''' as subj_refname from    filmsubjs fs,    subjects s where fs.subj_id = s.subj_id and fs.film_id > 1 and s.subj_id > 1 order by legacy_film_id, subject_name;
Notes
Talend/ETL
Cataloging - documents
Collection Objects - documents - SQL: Chris, Talend: Chris
Field mapping
SQL
SELECT d.doc_id, title, rtrim(prefix) prefix, case when (prefix like '%''' or prefix like '%-' or prefix is null) then prefix||title else prefix||' '||title end as docDisplayName, type_id, pages numberObjects, pg_info pagination, case when d.src_id <> 1 then d.src_id end as srcShortId, case when d.src_id <> 1 then n.name_type end as srcNameType, n.fname + substring( ' ', sign( datalength( n.fname ) ), 1 ) + n.mname + substring( ' ', sign( datalength( n.mname ) ), 1 ) + n.lname + substring( ' ', sign( datalength( n.gen_suffix ) ), 1 ) + n.gen_suffix as srcDisplayName, date_string, early_date, late_date, case when (early_date>2415021 and early_date<2456602) then dateadd(dd,(early_date - 2415021), convert(date, '1900-01-01', 103)) end as begin_date_scalar, case when (early_date>2415021 and early_date<2456602) then datepart(year, dateadd(dd, (early_date - 2415021), convert(datetime, '1900-01-01', 103))) end as begin_date_year, case when (early_date>2415021 and early_date<2456602) then datepart(month, dateadd(dd, (early_date - 2415021), convert(datetime, '1900-01-01', 103))) end as begin_date_month, case when (early_date>2415021 and early_date<2456602) then datepart(day, dateadd(dd, (early_date - 2415021), convert(datetime, '1900-01-01', 103))) end as begin_date_day, case when (late_date>2415021 and late_date<2456602) then dateadd(dd,(late_date - 2415021), convert(date, '1900-01-01', 103)) end as end_date_scalar, case when (late_date>2415021 and late_date<2456602 and early_date<>late_date) then datepart(year, dateadd(dd,(late_date - 2415021), convert(datetime, '1900-01-01', 103))) end as end_date_year, case when (late_date>2415021 and late_date<2456602 and early_date<>late_date) then datepart(month, dateadd(dd,(late_date - 2415021), convert(datetime, '1900-01-01', 103))) end as end_date_month, case when (late_date>2415021 and late_date<2456602 and early_date<>late_date) then datepart(day, dateadd(dd,(late_date - 2415021), convert(datetime, '1900-01-01', 103))) end as end_date_day, sort_date, d.code, d.note briefDescription, case when cast_cr=1 then 'true' else 'false' end as cast_cr, case when tech_cr=1 then 'true' else 'false' end as tech_cr, case when bx_info=1 then 'true' else 'false' end as bx_info, case when filmog=1 then 'true' else 'false' end as filmog, case when dist_co=1 then 'true' else 'false' end as dist_co, case when prod_co=1 then 'true' else 'false' end as prod_co, case when cost=1 then 'true' else 'false' end as cost, case when illust=1 then 'true' else 'false' end as illust, case when biblio=1 then 'true' else 'false' end as biblio, d.entered, d.entered_by, d.modified, d.modified_by, u.src_id urlsource, n2.name_type urlnametype, u.docurl, n2.lname urlsourcename FROM docs d left outer join names n on n.name_id=d.src_id left outer join docurls u on (d.doc_id=u.doc_id and u.note is null and u.src_id<>486) left outer join names n2 on n2.name_id=u.src_id where d.doc_id<>1 order by d.doc_id
Notes
This job currently excludes records with bad dates. Michael is fixing these but need to confirm. Also, the Talend job should be extended to reuse the same CSIDs to facilitate Glen's work on Media Handling (i.e., it will be good if the document CSIDs persist though new records entered in CineFiles will need new CSIDs of course).
Talend/ETL
MERGE: Document Subjects - SQL: Chris, Talend: Chris
Repeating: Concept-subject vocabulary
Field mapping
SQL
SELECT doc_id, d.subj_id, s.subj FROM docsubjs d inner join subjects s on s.subj_id = d.subj_id where d.subj_id<>1 order by doc_id
Notes
Talend/ETL
MERGE: Name Subjects - SQL: Chris, Talend: Chris
Repeating: person-person vocabulary and org-corporate vocabulary
Field mapping
SQL
SELECT nd.doc_id, nd.namesubj_id, case when n.name_type=2 then 'organization' when n.name_type=1 then 'person' when n.name_type=3 then 'committee' end as subjectVocab, ltrim(n.fname + substring( ' ', sign( datalength( n.fname ) ), 1 ) + mname + substring( ' ', sign( datalength( n.mname ) ), 1 ) + lname + substring( ' ', sign( datalength( n.gen_suffix ) ), 1 ) + n.gen_suffix) as displayName, case when n.name_type=1 then 'personauthorities' else 'orgauthorities' end as authtype FROM namedocs nd left outer join names n on n.name_id=nd.namesubj_id where nd.namesubj_id<>1 and nd.doc_id<>1 order by nd.doc_id
Notes
Talend/ETL
MERGE: Film Subjects - SQL: Chris, Talend: Chris
Repeating: works-film vocabulary
Field mapping
SQL
SELECT fd.doc_id, fd.film_id, f.title, 'urn:cspace:cinefiles.cspace.berkeley.edu:workauthorities:name(work):item:name(' + convert(varchar(10), f.film_id) + ')''' + substring( substring( f.prefix + ' ', 1, datalength( f.prefix + ' ' ) - sign( patindex( '%[''-]', f.prefix ))), sign( patindex( '%[^^I ]%', f.prefix )), 10 ) + f.title + '''' as refname FROM filmdocs fd left outer join films f on fd.film_id=f.film_id where fd.film_id <> 1 and fd.doc_id <> 1 order by fd.doc_id
Notes
Talend/ETL
MERGE: Document Languages - SQL: Chris, Talend: Chris
Repeating: Language values (is this a URN-formatted name?)
Field mapping
SQL
SELECT dl.doc_id, dl.lang_id, l.lang_tla, lang FROM doclangs dl left outer join languages l on dl.lang_id=l.lang_id where dl.lang_id <> 1 and l.lang_id <> 1 order by dl.doc_id
Notes
Talend/ETL
MERGE: Document authors - SQL: Chris, Talend: Chris
Repeating: From person though there are 16 docauthor records in Sybase that are Corporate Names.
Field mapping
SQL
SELECT doc_id, a.name_id authorShortId, case when n.name_type=2 then 'organization' when n.name_type=1 then 'person' end as authorVocab, n.fname + substring( ' ', sign( datalength( n.fname ) ), 1 ) + mname + substring( ' ', sign( datalength( n.mname ) ), 1 ) + lname + substring( ' ', sign( datalength( n.gen_suffix ) ), 1 ) + n.gen_suffix as authDisplayName FROM docauthors a left outer join names n on n.name_id=a.name_id where a.name_id<>1 order by doc_id
Notes
Talend/ETL
Procedures
Media Handling - document pages - SQL: TBD, Talend: TBD
Field mapping
SQL
SELECT doc_id, convert( varchar(15), doc_id ) as shortIdentifier, pages, page, convert( varchar(5), page ) as pageNumber, convert( char(5), datepart(year, entered)) + substring( convert( varchar(50), entered, 7), 1, charindex( ',', convert( varchar(50), entered, 7))-1) as displayDate, datepart( year, entered ) as earliestYear, datepart( month, entered ) as earliestMonth, datepart( day, entered ) as earliestDay, entered as earliestScalarDate, dateadd(day, 1, entered) as latestScalarDate, entered, modified, ISNULL(entered_by, '') as enteredBy, ISNULL(modified_by, ISNULL(entered_by, '')) as modifiedBy, src_id, convert( varchar(15), src_id) as publisherIdentifier FROM docs d, pages p WHERE d.pages >= p.page AND doc_id > 1
Notes
The query includes a join on a synthetic table that only contains a list of page numbers. By joining on the 'pages' table I get a row for each page up to the number of "pages' indicated in the docs table. As usual, doc_id 1 is a dummy value.
Talend/ETL
Media loading
Start by getting the CSID and short title (documentID + page number) from the media_common table.
(code}
select m.title,
h.name as CSID,
regexp_matches( m.title, '^\d{1,5}.p\d+$') as mediatitle
from hierarchy h, media_common m
where h.id = m.id
and (m.blobcsid is null or length(m.blobcsid) = 0)
Save the output into a file (e.g. "media.list"). The first field is used to sort numerically and can then be removed. The sorting step is useful for uploading image files in smallish blocks with contiguous document IDs. Nowrun the upload script with the "media.list"
BASEURL="https//cinefiles.cspace.berkeley.edu/cspace-services/media"
TYPE="Content-type: application/xml"
USER="admin@cinefiles.cspace.berkeley.edu:xxxxxxx"
IMGDIR="/my/image/file/location"
OUTFILE="curl.out"
while read F
do
CSID=${F%% *}
IDENT=${F##* }
# Select the best version of the file
FILE="${IDENT}.600col.tif"
[ -f $IMGDIR/$FILE ] || FILE="${IDENT}.450col.tif"
[ -f $IMGDIR/$FILE ] || FILE="${IDENT}.400col.tif"
[ -f $IMGDIR/$FILE ] || FILE="${IDENT}.300col.tif"
[ -f $IMGDIR/$FILE ] || FILE="${IDENT}.150col.tif"
[ -f $IMGDIR/$FILE ] || FILE="${IDENT}.600gray.tif"
[ -f $IMGDIR/$FILE ] || FILE="${IDENT}.450gray.tif"
[ -f $IMGDIR/$FILE ] || FILE="${IDENT}.400gray.tif"
[ -f $IMGDIR/$FILE ] || FILE="${IDENT}.300gray.tif"
[ -f $IMGDIR/$FILE ] || FILE="${IDENT}.400dpi.tif"
# Did we find one?
[ -f $IMGDIR/$FILE ] || continue
URL="${BASEURL}/$CSID?blobUri=file://$IMGDIR/$FILE"
LOGFILE="$OUTFILE.$IDENT"
curl -X POST -i -u "$USER" -H "$TYPE" $URL -o $LOGFILE
mv $IMGDIR/$FILE $IMGDIR/done
done < media.list
The media record title should reflect the original file name of the attached blob, but when blobs are attached to an existing media record in a batch, the title does not get updated. After importing blobs, the media records can be updated in the database to set the title correctly.
BEGIN;
update media_common set title = mbt.filename
from utils.media_blob_titles mbt
where media_common.blobcsid = mbt.blobcsid
and media_common.title similar to '0123456789.p0123456789'
and mbt.filename like media_common.title||'.%';
COMMIT;
Note that it is a good idea to wrap the update in a transaction. I included both the BEGIN and COMMIT statements here, but in practice the results should be reviewed before issuing the COMMIT.
Relationships: Media Handling to Collection Objects
Lam is running the Talend job originally developed by Yuteh for PAHMA and UCJEPS