PAHMA data migration ETL work

This page contains information about data ETL work for PAHMA's deployment of CollectionSpace.

Current Plan

Please see our current list of questions and meeting notes.

Total fields in TMS and TMSThesaurus: 3242 (3111 + 131)
Fields to be mapped to CSpace from TMS and TMSThesaurus: 763 (699 + 64) or less

Plan is to do data migration in three phases:

Earlier than main migration; can be done several weeks in advance of "go live" date
Main migration; must be done in very limited time window (while TMS system is shut down?)
Later migration; can be done anytime after "go live" date
Michael T. Black has divided all TMS fields to be mapped into three groups, determining when they will be migrated:

Types of data targeted for earlier migration:
- Persons/Organizations
- Contact information
- Controlled lists
- Bibliographic references
- Storage locations
- Place authority
- Culture authority
- Ethnographic use code authority
Types of data targeted for main migration:
- Cataloging info
- Sub-objects (Components)
- Relationships (object-object)
- Groups
- Dimensions
- Containers
- Media handling
- Acquisitions (Accessions)
- Object exit (Deaccessions)
- Intake
- Loans
- NAGPRA information
- all cross-references to data in earlier migration
Types of data targeted for later migration:
- Shipments
- Exhibits
- Versioning/audit info
- Insurance and valuation
- Object use
- Conservation

Status update - 10-August-2011

Michael and Susan have already done a huge amount of work. Susan's Kettle files include notes, but Michael is the person who understands the TMS data model.
Most existing Kettle jobs are pretty good. The exception is object cataloging data, which will need some significant updating.
All jobs need to be studied to check assumptions and look for limits that were put in place to create sample data sets.
Most input and output data files are in the files that Susan provided. Some might be misplaced and need to be recreated.
Rather than rewrite Kettle jobs in Talend, we will probably take Kettle CSV outputs and parse them through Talend into the appropriate XML format.
Michael can create views in MSSQL as needed.
If we can figure out how to create CSIDs (UUIDs) in MSSQL, that might help data migration.
Some of the files from Susan's work are no longer needed. E.g., there are jobs that are now becoming simple controlled lists.

ETL Jobs

Job	Description	Inputs	Outputs	Notes
tms_obj_v0.1_all.ktr	Object records first job	11 TMS queries	text	CSpace 1.3 schema. For Object records run this job first. See merge_object_data.ktr for notes on sequence for running object transformations, starting with tms_obj_v0.1_all (collection site is a job with 3 transformations) reference and inscription are done as updates after objects loaded
tms_obj_collsite1.ktr	Object records -- collection site job#1	3 TMS queries (object site, provenance from object, textentries)	matching text file & non-mtching tex tfile	Extract sitename/sitenote/provenance and merge with textentry places
tms_obj_collsite2.ktr	Object records -- collection site job#2	3 Thesaurus files (tmsthesaurus..terms, tms..thesxrefs, tmsthesaurus..termmaster)	text file	Take records failed to match with textentry places (last step, tms_obj_collsite1.ktr) and merge them with thesaurus lookup (provenance & card header geog info)
tms_obj_collsite3.ktr	Object records -- collection site job#3	2 text file	text file	Merge the last two steps to create a single collection site file with objectid, sitename & sitenote
coll_site_job.kjb	Sequences "collection site" job runs	3 Kettle jobs for "collection site"	resulting file from tms_obj_collsite3.ktr	Enforce the 3 jobs for object "collection site" to be run in sequential manner
tms_obj_collector.ktr	Object records -- collector	4 TMS queries (object collector, collection date, field collection number, date)	text file "tms_obj_collector"	Extract/merge to create file w/ objectid, collectorRefName, collectionDate, fieldnum, collectionnote
tms_obj_production.ktr	Create object records' production info	4 TMS queries (production person/org, date, reason-tourist, place-minting) and TSM Thesaurus of culture term	text file "tms_obj_prod"	Merge object production person/org RefName/role with production date (made/photo/pub/mint/ruler), culture, reason and minting place
merge_object_data.ktr	Creates final Object records	4 text files (main object, collection site, collector, production)	text "tms_obj"	run this after creating: 1. main obj data 2. collectors (using id output from 1 as input) 3. collection sites (using job that calls three transformations) (usig ids output from 1 as input to tms_obj_collsite1) 4. tms_obj_production (using id output from 1 as input) BE SURE TO CHANGE OBJECT INPUT AND OUTPUT EXTENSION n=140K (4/1/2011)
reference.ktr	Creates object's reference string	3 TMS queries (referencemaster, refobjxref, object) and a CSID text file	text "tms_obj_ref"	Creating "ref" by concatenating title/subtitle/placepublished/copyright... in referencemaster & refobjxref DBs, then merge/join w/ "ref" (created by normalizing pubreference & bibliography) from "object" DB. At last step, merge in CSID.(link key objectid=tmsID). n=3598 (4/1/2011)
tms_inscription.ktr	Creates objects inscription information	3 TMS queries (TextEntries, Constituents, object) and a CSID text file	text "tms_inscription"	Merge object's (1st/single) TextTypeID/TextEntry (i.e.Inscription Content)/Inscriber/MTB_dated/MTB_Language/MTB_Interpretation/MTB_...) with CSID (link key objectid=tmsID). n=1052 (4/1/2011)
media.ktr	Creates Media Handling records		text	CSpace 1.6 schema.
acq_xml.ktr	Creates Acquisition records (PAHMA accessions)	8 TMS queries, 1 txt file of accession methods from another job	XML file (/home/sstone/cspace_general/import/acq_import.xml), CSID output (/tmp/acq_csids)	Note: Collector refname should be collecting event here. Note: This outputs 10,000 acquisition xml records per file without adding top-level element.
loan_in.ktr	Creates Loan In records		text	CSpace 1.3 schema
loan_out.ktr	Creates Loan Out records		text	CSpace 1.3 schema
loc_move.ktr	Creates Location and Movement Control records		text	CSpace 1.3 schema
object_exit.ktr	Creates Object Exit records		text	CSpace 1.3 schema

obsolete jobs
person_authority.ktr N.B. please see updated info	Creates Person authority records	3 TMS queries (person, birth, death) N.B. these are obsolete	person.txt (7012 records in file dated Feb 11, 2011)	Mapping is pretty good. Legacy data has many duplicates with alternative spellings. MTB has created a field that identifies the preferred name for each record. We should take advantage of that to help create synonym relationships. Job can select distinct Salutation and NameTitle values too. Note: Michael will move birth and death dates from condates into constituents (or will have these dates manually re-entered post-migration). Note: Don't really want to remove these delete ones (filter out records marked delete). They should be reviewed.
org_authority.ktr N.B. please see updated info	Creates Organization authority records	2 TMS queries (org, foundation date) N.B. these are obsolete	org.txt (1577 records in file dated Feb 14, 2011)	Mapping is pretty good. Legacy data has many duplicates with alternative spellings. MTB has created a field that identifies the preferred name for each record. We should take advantage of that to help create synonym relationships. Can produce Org Functions values too. Note: No End Date Info for Organizations in ConDates. This info will be moved into constituents. Note: need remarks field in org authority

CollectionSpace Deployments

PAHMA data migration ETL work

Analytics

Current Plan

Status update - 10-August-2011

ETL Jobs