How to Import Data

This document, while broadly accurate, needs to be updated to match current documentation standards.

This page needs to be edited and expanded.

Last major edit by: Chris Hoffman

Applies to versions:

Tested: Yes/No

Questions and notes on this document:

  • This doc refers out to other documents (e.g., "See structure of the request in Imports Service Home").  Is that appropriate or should content be in this document?
  • Information towards the end of the document describing how to delete records should probably be moved to a separate document.

How to Import Data using the Imports Service

This document describes how to import data into CollectionSpace using the Imports Service.

A separate document (to be written) describes methods for extracting and transforming data from other systems in preparation for import (including using Talend Open Studio).

Prerequisites

This work requires:

  • An XML file containing records conforming to the XML schema for the corresponding CollectionSpace record type.
  • A user account in the CollectionSpace instance that has privileges to add data to the record in question.

Also helpful is:

  • Understanding of shell scripts

Layers involved

To carry out this task, you will need to make changes in the following CollectionSpace layer(s):

  • None.

Procedure

Step by step instructions - just a draft here.

  • Confirm that the data file has the correct structure. (See "Structure of the request" in Imports Service Home).
  • Confirm that the data file is a well-formed XML document. Some text editors, XML editors, integrated development environments (IDEs), and other tools can help you do this.
  • Confirm that the data file encodes these five special characters, everywhere they appear in the document, as XML entities. (See, for instance, Add entities in XML for examples.)
    • ampersands (&)
    • apostrophes (')
    • quotation marks (")
    • 'less than' symbols (<)
    • 'greater than' symbols (>)
  • Issue a curl command to send the data file to the CollectionSpace server:
    In CollectionSpace v2.2 and above:

    curl http://localhost:8180/cspace-services/imports?type=xml
        -i
        -u admin@core.collectionspace.org:Administrator
        -F "file=@myimportfile.xml;type=application/xml"
    

    or in any recent version of CollectionSpace (including 2.2 and above):

    curl -X POST http://localhost:8180/cspace-services/imports
         -i
         -u admin@core.collectionspace.org:Administrator
         -H "Content-Type: application/xml"
         -T myimportfile.xml
    

An example of a shell script that assembles the pieces to submit an import file, using the first method above:

#!/bin/bash

DATA=myimportfile.xml
URL="http://localhost:8180/cspace-services/imports?type=xml"
CONTENT_TYPE="application/xml"
USER="admin@YOUR_TENANT_NAME_HERE:YOUR_PASSWORD_HERE"
# Example: USER="admin@core.collectionspace.org:Administrator"

echo "Sending $DATA"
echo "to $URL"
echo "with content type $CONTENT_TYPE"
echo "as $USER"

curl $URL -i -u "$USER" -F "file=@$DATA;type=$CONTENT_TYPE" -o curl.out

mv $DATA ${DATA}.done
cat curl.out

Test

  • The console output will provide some information, especially if the job fails completely. However, the Imports Service generally reports success even when it has encountered problems. Additional logging is available in the cspace-services.log and catalina.out log files, in CSPACE_JEESERVER_HOME/logs. (See "Where to find errors" in Imports Service Home.)
  • Run data quality checks against the CollectionSpace database and system. If you loaded a batch of 1,000 records, check to see that the record count has incremented by the same number.

It looks like the Imports Service does not necessarily process the import file in the record order of the file. This has not been confirmed.

Problems have been encountered loading large data sets.  While experience has varied, batches of 10,000 records might be a maximum size.

If it doesn't work

Data will not load because of a formatting problem

  • Correct the formatting problem and run the curl-based import again

Some data were loaded but need to be deleted (THIS SHOULD BE BROKEN OUT AS A SEPARATE DOCUMENT)

  • Obtain a list of the CSIDs for the records that need to be deleted and issue a curl command to delete them.

An example of a shell script that sends delete commands to remove a set of items, specified by their CSIDs, from a specified Person Authority:

#!/bin/bash

SERVICE="cspace-services/personauthorities/2df0880c-2576-4127-b1a8/items"
CONTENT_TYPE="Content-Type: application/xml"
URL="http://localhost:8180"
USER="admin@YOUR_TENANT_NAME_HERE:YOUR_PASSWORD_HERE"
# Example: USER="admin@core.collectionspace.org:Administrator"

flag=1
count=0

while read CSID
do
  if [ $flag -eq 1 ]
  then
    echo "curl -X DELETE $URL/$SERVICE/$CSID -u \"$USER\" -H \"$CONTENT_TYPE\""
  fi
  count=$((count+1))
  echo -n "${count}: $CSID"
  curl -X DELETE $URL/$SERVICE/$CSID -u "$USER" -H "$CONTENT_TYPE"
  echo $CSID >> deleted.list
  echo " : Sleeping"
  sleep 7
  flag=0
done < todelete.list

Note that the sleep command forces the script to pause for 7 seconds to allow the system to clean up.

If you delete a record in the CollectionSpace UI (e.g., a person record), it actually remains in the database, though with a lifecyclestate (in the 'misc' table) of 'deleted'. That record has just been 'soft deleted' - marked as in a deleted state, but with its data still present. If you try to import a record using the same CSID, the Imports Service will report a successful import, but the record will not be added.

You need to send a delete command via the services' REST-based APIs - for example, by running a curl-based delete - to remove the record completely, and you can do so even for a record that does not appear in the UI.

See Also

Imports Service Home

Imports Service user instructions