Imports Service user instructions

Draft with notes and next steps

Note: Process and documentation for initializing an authority required. Currently, Glen creates a record in the CSpace UI which initializes the default authority and creates a CSID.  That CSID can then be used in the steps below.

1. Create at least one sample record each of your various record types. Make sure you have data in any repeating fields or field groups.

2. Export those records using the Web-based administrative interface to Nuxeo, using Laramie's step-by-step instructions here:

<http://wiki.collectionspace.org/display/collectionspace/Imports+Service+Home#ImportsServiceHome-Howtodeterminethecorrectvaluestoputintoan%22%2Fimports%2Fimport%2Fschema%22element%3A>

Notes:

  • The Nuxeo web-based administrative interface comes with CSpace installation, and runs on port 8080. 
  • We need to check to see if complex catalog record is correctly returned, e.g., nested repeating elements, and having multiple schemas, having ampersands and other character entities.  Note: It looks like you want to export a record that has data in all fields, but especially nested (repeating) structures.  Otherwise the  XML file might not have the XML elements for the nested structure.  See Talend-ETL work for UCB deployments notes as well.
  • Need to confirm how we change nuxeo administrator password, both in nuxeo and in services configuration.
  • This Nuxeo-based export file can also be used to seed the XML tree for outputs created via Talend Open Studio

3. Use the Ruby script near the end of that document to take the records you export in step 2, and convert them to a format that the import service can ingest. You can do this manually, as well, but a script makes this easier.  (Glen created an 'ed' script for the same purpose; you can ask him for that, if you wish.)

Note: We should rewrite this in groovy or something more standard. In the meantime, the ruby script should be in subversion. Multiple copies are sitting around. Aron will create a location for this, in a scripts directory.

If you use the Ruby script, you'll need to do two things:

a. A one-time task: install Ruby on your system, if it isn't already present. There are examples / links here on how to do so:

<http://wiki.collectionspace.org/display/collectionspace/Imports+Service+Home#ImportsServiceHome-ImportingrecordsexportedfromaCollectionSpaceinstance>

b. Edit the three variables near the top of the script to reflect the specifics of the record type you're importing. Step 2 should give you the information you need to fill in these values:

servicename = "Persons"

recordtype = "Person"

schemas = [ "persons_common" ]

In the case of records where pertinent data is stored in more than one schema; e.g. a common schema, like collectionobjects_common, and one or more extension schemas, like collectionobjects_naturalhistory, the values of the 'schemas' variable might look like this:

schemas = [ "collectionobjects_common", "collectionobjects_naturalhistory" ]

(Scripting contributions to make these variables command-line parameters are welcome :-)

4. If you need to generate your own CSIDs for authority terms, see Glen's comment at the bottom, which provides useful additions to that Ruby script.

5. Manually do whatever cleanup may be needed of characters or character sequences in the data itself that may trip up the import
service, either via search and replace, or using a 'sed' script, BBEdit text factory, etc. From what I recall, I had to do the following:

    1. un-escaped special XML characters need to be turned into XML entities - &, <, >, ', " Should be done in groovy or ruby script.
    2. XML entities need to be doubled (e.g., in the ruby or groovy script) (Should be fixed in ruby script, replacing with &amp;)
    3. As well, you might look for dollar signs, which are triggers for macro interpolation. I didn't happen to run across any of those in the UCJEPS Person records, and so don't know first-hand how you might munge those. Maybe that is only an issue if the format is ${sometext}, so this might not be an issue.

6. Perform the import (curl, or wrap into the ruby-groovy script).

Notes:

  • Currently, import performance seems to slow down with large record sets.  Glen splits these into batches of 5k to 10k records.
  • File system is filling up also.