PAHMA data migration ETL work
This page contains information about data ETL work for PAHMA's deployment of CollectionSpace.
Current Plan
Please see our current list of questions and meeting notes.
Total fields in TMS and TMSThesaurus: 3242 (3111 + 131)
Fields to be mapped to CSpace from TMS and TMSThesaurus: 763 (699 + 64) or less
Plan is to do data migration in three phases:
- Earlier than main migration; can be done several weeks in advance of "go live" date
- Main migration; must be done in very limited time window (while TMS system is shut down?)
- Later migration; can be done anytime after "go live" date
Michael T. Black has divided all TMS fields to be mapped into three groups, determining when they will be migrated:
- Types of data targeted for earlier migration:
- Persons/Organizations
- Contact information
- Controlled lists
- Bibliographic references
- Storage locations
- Place authority
- Culture authority
- Ethnographic use code authority
- Types of data targeted for main migration:
- Cataloging info
- Sub-objects (Components)
- Relationships (object-object)
- Groups
- Dimensions
- Containers
- Media handling
- Acquisitions (Accessions)
- Object exit (Deaccessions)
- Intake
- Loans
- NAGPRA information
- all cross-references to data in earlier migration
- Types of data targeted for later migration:
- Shipments
- Exhibits
- Versioning/audit info
- Insurance and valuation
- Object use
- Conservation
Status update - 10-August-2011
- Michael and Susan have already done a huge amount of work. Susan's Kettle files include notes, but Michael is the person who understands the TMS data model.Â
- Most existing Kettle jobs are pretty good. The exception is object cataloging data, which will need some significant updating.
- All jobs need to be studied to check assumptions and look for limits that were put in place to create sample data sets.
- Most input and output data files are in the files that Susan provided. Some might be misplaced and need to be recreated.
- Rather than rewrite Kettle jobs in Talend, we will probably take Kettle CSV outputs and parse them through Talend into the appropriate XML format.
- Michael can create views in MSSQL as needed.
- If we can figure out how to create CSIDs (UUIDs) in MSSQL, that might help data migration.
- Some of the files from Susan's work are no longer needed. E.g., there are jobs that are now becoming simple controlled lists.
ETL Jobs
Job |
Description |
Inputs |
Outputs |
Notes |
---|---|---|---|---|
tms_obj_v0.1_all.ktr |
Object records first job |
11 TMS queries |
text |
CSpace 1.3 schema. For Object records run this job first. See merge_object_data.ktr for notes on sequence for running object transformations, starting with tms_obj_v0.1_all (collection site is a job with 3 transformations) |
tms_obj_collsite1.ktr |
Object records -- collection site job#1 |
3 TMS queries (object site, provenance from object, textentries) |
matching text file & non-mtching tex tfile |
Extract sitename/sitenote/provenance and merge with textentry places |
tms_obj_collsite2.ktr |
Object records -- collection site job#2 |
3 Thesaurus files (tmsthesaurus..terms, tms..thesxrefs, tmsthesaurus..termmaster) |
text file |
Take records failed to match with textentry places (last step, tms_obj_collsite1.ktr) and merge them with thesaurus lookup (provenance & card header geog info) |
tms_obj_collsite3.ktr |
Object records -- collection site job#3 |
2 text file |
text file |
Merge the last two steps to create a single collection site file with objectid, sitename & sitenote |
coll_site_job.kjb |
Sequences "collection site" job runs |
3 Kettle jobs for "collection site" |
resulting file from tms_obj_collsite3.ktr |
Enforce the 3 jobs for object "collection site" to be run in sequential manner |
tms_obj_collector.ktr |
Object records -- collector |
4 TMS queries (object collector, collection date, field collection number, date) |
text file "tms_obj_collector" |
Extract/merge to create file w/ objectid, collectorRefName, collectionDate, fieldnum, collectionnote |
tms_obj_production.ktr |
Create object records' production info |
4 TMS queries (production person/org, date, reason-tourist, place-minting) and TSM Thesaurus of culture term |
text file |
Merge object production person/org RefName/role with production date (made/photo/pub/mint/ruler), culture, reason and minting place |
merge_object_data.ktr |
Creates final Object records |
4 text files (main object, collection site, collector, production) |
text |
run this after creating: |
reference.ktr |
Creates object's reference string |
3 TMS queries (referencemaster, refobjxref, object) and a CSID text file |
text |
Creating "ref" by concatenating title/subtitle/placepublished/copyright... in referencemaster & refobjxref DBs, then merge/join w/ "ref" (created by normalizing pubreference & bibliography) from "object" DB. At last step, merge in CSID.(link key objectid=tmsID). |
tms_inscription.ktr |
Creates objects inscription information |
3 TMS queries (TextEntries, Constituents, object) and a CSID text file |
text |
Merge object's (1st/single) TextTypeID/TextEntry (i.e.Inscription Content)/Inscriber/MTB_dated/MTB_Language/MTB_Interpretation/MTB_...) with CSID (link key objectid=tmsID). |
media.ktr |
Creates Media Handling records |
 |
text |
CSpace 1.6 schema. |
acq_xml.ktr |
Creates Acquisition records (PAHMA accessions) |
8 TMS queries, 1 txt file of accession methods from another job |
XML file (/home/sstone/cspace_general/import/acq_import.xml), CSID output (/tmp/acq_csids) |
Note: Collector refname should be collecting event here. |
loan_in.ktr |
Creates Loan In records |
 |
text |
CSpace 1.3 schema |
loan_out.ktr |
Creates Loan Out records |
 |
text |
CSpace 1.3 schema |
loc_move.ktr |
Creates Location and Movement Control records |
 |
text |
CSpace 1.3 schema |
object_exit.ktr |
Creates Object Exit records |
 |
text |
CSpace 1.3 schema |
 |
 |
 |
 |
 |
obsolete jobs |
 |
 |
 |
 |
person_authority.ktr |
Creates Person authority records |
3 TMS queries (person, birth, death) |
person.txt (7012 records in file dated Feb 11, 2011) |
Mapping is pretty good. Legacy data has many duplicates with alternative spellings. MTB has created a field that identifies the preferred name for each record. We should take advantage of that to help create synonym relationships. Job can select distinct Salutation and NameTitle values too. |
org_authority.ktr |
Creates Organization authority records |
2 TMS queries (org, foundation date) |
org.txt (1577 records in file dated Feb 14, 2011) |
Mapping is pretty good. Legacy data has many duplicates with alternative spellings. MTB has created a field that identifies the preferred name for each record. We should take advantage of that to help create synonym relationships. Can produce Org Functions values too. |