This document pulls together information about ETL work for CollectionSpace deployments. We are taking a team-based approach to get more of us up to speed.

Topics

Installing Talend Open Studio

Starting up the first time

Connecting to a database

A picture is worth a thousand words.

Use read-only accounts in the database

Common workflows, best practices, and tips

Creating an XML schema to seed the XML tree in the tAdvancedFileOutputXML component

<?xml version="1.0" encoding="UTF-8"?>
<imports>
  <import service="Persons" type="Person" CSID="4fbab950-8e36-4977-a46f-6491f799aa03">
    <schema xmlns:persons_common="http://collectionspace.org/services/person" name="persons_common">
      <persons_common:foreName>Your exported data from Nuxeo starts here</persons_common:foreName>
      ... more XML data elements ...
    </schema>
  </import>
</imports>

Character and data issues

Note: It seems to be very hard to represent the actual text for the XML versions of ampersand and other characters that need to be escaped in XML. 

Problems

TOS crashes when retrieving large schema

See Glen's comment below about success with the latest release.

Chris: Encounters this often when trying to import Sybase schema into TOS metadata.  Restarted machine and launched TOS only.  Created a new project and new database connection to smaschprod.  Tried to import all tables in smaschprod.  TOS crashed and pointed to /Applications/TOS-All-r60995-V4.2.1/workspace/.metadata/.log.  File includes the following:

!ENTRY org.talend.platform.logging 2 0 2011-08-08 13:20:39.754
!MESSAGE 2011-08-08 13:20:39,726 WARN  org.talend.core.model.metadata.DBConnectionFillerImpl  - S0022: Invalid column name 'TABLE_CATALOG'.

I can import schema on individual files.  So, I tried importing only a large set of table schema (A-G tables).  That did work successfully.  However, when I tried to add more table schema, TOS hung after I selected "Retrieve Schema" by right-clicking on the database connection.  Mac logging reports available upon request!

TOS crashes

Lam continues to have problems on WinXP with frequent but intermittent crashes.  E.g., just opening projects.

XML schema as metadata in TOS repository

Chris can create XML schema in the TOS repository, but he can not yet use those in a TOS job.  Instead, for each job, you can import the XML tree in the advanced XML output component.

XML schema coming from Nuxeo might not have all elements

Best practice might be to create a dummy record with values in all fields, including repeatable fields and field groups.  For instance, when exporting a minimal record from a test PAHMA instance, the organization schema lacked some of the nested structures. For example, output from Nuxeo might include:

<organizations_common:groups/>

instead of:

<organizations_common:groups>
     <organizations_common:group>Japanese</organizations_common:group>
</organizations_common:groups>

TOS tAdvancedFileOutputXML component escapes ampersands automatically

The tAdvancedFileOutputXML component wants to convert ampersands to the escaped XML version (ampersand amp semicolon).  This is good except that because of a bug in the Imports Service, ampersand needs to be represented differently (ampersand amp semicolon amp semicolon).  Treat accordingly.

TOS can not loop CSpace XML schema?

When trying to import denormalized data (using the loop element and group element on the tAdvancedFileOutputXML component), I (Chris) am not finding a way to get it to work.  Inevitably the "import" element is only printed out one time.  In other words, if I am expecting to create records, I would want each one to have its own "import" wrapper within the overall "imports" wrapper.  Within each "import" wrapper, I would expect one "schema" wrapper, in this simple case because I am only using a simple non-extended schema.  Instead, I have one "import" wrapper enclosing ten otherwise nicely formed "schema" wrappers with the correct format.  The looping to create repeating or nested entities such as organization names or functions seems to work correctly, but I am only getting one "import" wrapper.  I'm not sure what is going on and have tried many permutations. Tested in TOS 5; behaves identically.

When import XML tree with more than one "schema" wrapper, only one "schema" wrapper kept

If you have an XML file with more than one "schema" wrapper (e.g., including local or domain extensions), on importing into the tAdvancedFileOutputXML component, only one "schema" element is retained.  It looks like the fields from the other schema are kept though they are all combined under one "schema" wrapper.  Apparently, any elements with non-unique names are dropped.  Yuteh reports that the same thing happens with structured date sub-elements.  If you have multiple structured dates in your schema, only the first set of sub-elements are recognized. Tested in TOS 5; behaves identically.  Finding: The problem is that the core element name is not unique at the same level of the XML tree.  You can rename those (e.g., schema1, schema2, and so on) and then change the name back after the file has been created.  It looks like having multiple structured dates is OK though non-unique names are repeated in the schema, perhaps because the non-unique names do not occur at the same level of the XML tree within the same node.  Yuteh had reported some problems with actual data mapping however.

Empty records created mysteriously

September 16: While loading 9 PAHMA objects, somehow records were created with no data even though only minor changes had been done in the Talend job.  Yuteh and Chris were trying to handle ampersands, and checking whether the slightly dirty version of hex return-newline would work (where we had amp amp sharp x0A).  Nine object records were created with CSIDs only.  Still checking this.

Resources