How to Import Data
This document, while broadly accurate, needs to be updated to match current documentation standards.
This page needs to be edited and expanded.
Last major edit by: Chris Hoffman
Applies to versions:
Tested: Yes/No
Questions and notes on this document:
- This doc refers out to other documents (e.g., "See structure of the request in Imports Service Home"). Is that appropriate or should content be in this document?
- Information towards the end of the document describing how to delete records should probably be moved to a separate document.
How to Import Data using the Imports Service
This document describes how to import data into CollectionSpace using the Imports Service.
A separate document (to be written) describes methods for extracting and transforming data from other systems in preparation for import (including using Talend Open Studio).
Prerequisites
This work requires:
- An XML file containing records conforming to the XML schema for the corresponding CollectionSpace record type.
- A user account in the CollectionSpace instance that has privileges to add data to the record in question.
Also helpful is:
- Understanding of shell scripts
Layers involved
To carry out this task, you will need to make changes in the following CollectionSpace layer(s):
- None.
Procedure
Step by step instructions - just a draft here.
- Confirm that the data file has the correct structure. (See "Structure of the request" in Imports Service Home).
- Confirm that the data file is a well-formed XML document. Some text editors, XML editors, integrated development environments (IDEs), and other tools can help you do this.
- Confirm that the data file encodes these five special characters, everywhere they appear in the document, as XML entities. (See, for instance, Add entities in XML for examples.)
- ampersands (&)
- apostrophes (')
- quotation marks (")
- 'less than' symbols (<)
- 'greater than' symbols (>)
Issue a
curl
command to send the data file to the CollectionSpace server:
In CollectionSpace v2.2 and above:curl http://localhost:8180/cspace-services/imports?type=xml -i -u admin@core.collectionspace.org:Administrator -F "file=@myimportfile.xml;type=application/xml"
or in any recent version of CollectionSpace (including 2.2 and above):
curl -X POST http://localhost:8180/cspace-services/imports -i -u admin@core.collectionspace.org:Administrator -H "Content-Type: application/xml" -T myimportfile.xml
An example of a shell script that assembles the pieces to submit an import file, using the first method above:
#!/bin/bash DATA=myimportfile.xml URL="http://localhost:8180/cspace-services/imports?type=xml" CONTENT_TYPE="application/xml" USER="admin@YOUR_TENANT_NAME_HERE:YOUR_PASSWORD_HERE" # Example: USER="admin@core.collectionspace.org:Administrator" echo "Sending $DATA" echo "to $URL" echo "with content type $CONTENT_TYPE" echo "as $USER" curl $URL -i -u "$USER" -F "file=@$DATA;type=$CONTENT_TYPE" -o curl.out mv $DATA ${DATA}.done cat curl.out
Test
- The console output will provide some information, especially if the job fails completely. However, the Imports Service generally reports success even when it has encountered problems. Additional logging is available in the
cspace-services.log
andcatalina.out
log files, inCSPACE_JEESERVER_HOME/logs
. (See "Where to find errors" in Imports Service Home.) - Run data quality checks against the CollectionSpace database and system. If you loaded a batch of 1,000 records, check to see that the record count has incremented by the same number.
It looks like the Imports Service does not necessarily process the import file in the record order of the file. This has not been confirmed.
Problems have been encountered loading large data sets. While experience has varied, batches of 10,000 records might be a maximum size.
If it doesn't work
Data will not load because of a formatting problem
- Correct the formatting problem and run the curl-based import again
Some data were loaded but need to be deleted (THIS SHOULD BE BROKEN OUT AS A SEPARATE DOCUMENT)
- Obtain a list of the CSIDs for the records that need to be deleted and issue a curl command to delete them.
An example of a shell script that sends delete commands to remove a set of items, specified by their CSIDs, from a specified Person Authority:
#!/bin/bash SERVICE="cspace-services/personauthorities/2df0880c-2576-4127-b1a8/items" CONTENT_TYPE="Content-Type: application/xml" URL="http://localhost:8180" USER="admin@YOUR_TENANT_NAME_HERE:YOUR_PASSWORD_HERE" # Example: USER="admin@core.collectionspace.org:Administrator" flag=1 count=0 while read CSID do if [ $flag -eq 1 ] then echo "curl -X DELETE $URL/$SERVICE/$CSID -u \"$USER\" -H \"$CONTENT_TYPE\"" fi count=$((count+1)) echo -n "${count}: $CSID" curl -X DELETE $URL/$SERVICE/$CSID -u "$USER" -H "$CONTENT_TYPE" echo $CSID >> deleted.list echo " : Sleeping" sleep 7 flag=0 done < todelete.list
Note that the sleep
command forces the script to pause for 7 seconds to allow the system to clean up.
If you delete a record in the CollectionSpace UI (e.g., a person record), it actually remains in the database, though with a lifecyclestate (in the 'misc' table) of 'deleted'. That record has just been 'soft deleted' - marked as in a deleted state, but with its data still present. If you try to import a record using the same CSID, the Imports Service will report a successful import, but the record will not be added.
You need to send a delete command via the services' REST-based APIs - for example, by running a curl-based delete - to remove the record completely, and you can do so even for a record that does not appear in the UI.