CSV Import Tool: Workflows: CSV Import, Validation, and Transfer/Update of New/Existing Records

User stories

  • User can upload a CSV file to the CSV Import Tool

  • User can receive validation message from CSV Import Tool (valid/invalid)

  • User can receive data check message from CSV Import Tool (new/update/warnings)

  • User can export CSV with warnings (row/field/type) noted

  • User can transfer all, new only, or updates only to CollectionSpace

Summary

User can create or update CollectionSpace records via the CSV Import Tool

Rationale

Users need a way to bulk create and update records in CollectionSpace.

Users

CollectionSpace power users with permission to use the Tool.

Basic Course of Events

User interface/GUI application

Technical/under the hood

Comments/questions

User interface/GUI application

Technical/under the hood

Comments/questions

1

User logs in to the CSV Import Tool and creates a new Import/Update Batch

 

Delete Batch is a separate workflow.

v2 option: Implement projects to group together batches. If implemented, there would be intermediate steps: create a new Project or select existing Project in which to create Batch.

2

Batch creation screen: User selects which connection to use

 

 

3

Batch creation screen: User selects record type for the batch from a dropdown

 

Constrained by profile/version set in connection or group default.

4

Batch creation screen: OPTIONAL: User may override default import configuration and specify batch-specific import configuration options.

v1 - User can paste in a JSON config

v2 - GUI form for specifying/displaying default configs that will create JSON under the hood

 

v1 - global config that can be overriden @ batch level by pasting in JSON.

Default import config options include:

  • repeated value delimiter

  • repeated subgroup value delimiter

  • date element order (for dates like 05/03/2020 – is this May 3 or March 5?)

  • default field values override any data in spreadsheet (y/n)

Batch-specific import config options include:

  • changes/overrides to the default import config options

  • field value transformations including:

    • downcase values

    • find/replace (literal or regex supported)

  • default field values

v2 or beyond: The same default field values may be helpful across multiple batches. For example, the same defaults may be used for any load of Nomenclature concepts or ULAN names. Consider a way to save/reuse configs in the UI.

5

Batch creation screen: User browses to and selects the CSV file they want to upload

 

 

6

Batch creation screen: User clicks UPLOAD button

Importer Application:

  • Ingests and parses the CSV into individual rows (Records). Presuming it was able to do that, goes on to…

  • Stores Records in the database

  • Instantiates a Mapper::DataHandler, passing the import config JSON

 

7

Unless CSV parsing fails completely, user is shown indication of progress

 

Not sure how overall progress is to be determined when multiple different processes are being run, especially if they happen concurrently?

8

Post-import screen: If Importer Application was unable to parse the CSV file at all:

  • User is shown an informative error(1)

User cannot move forward in workflow from here.

If Importer Application was able to parse the CSV file:

  • User is shown the number of Records created from the CSV. This should correspond to the expected number of non-header rows in the CSV.(2)

  • User is shown list of any columns from CSV that will not be mapped

  • User is shown a VALIDATE button

 

(1) We typically run into this situation when a CSV contains invalid UTF-8. The average user cannot find one bad UTF-8 character in a CSV, so an informative error would say what row and field the character appears in.

Another issue that can arise here is that quote escaping can get messed up such that the CSV is unparseable.

(2) Bad quote escaping and/or the inclusion of various control (end of line) and whitespace characters within field values can cause a CSV to be parsed successfully but incorrectly. You may end up with fewer rows than expected or more rows than expected.

9

User clicks VALIDATE button

Importer Application:

  • Calls Mapper::DataHandler.validate(record) on first row

Mapper::DataHandler:

  • Performs validation of given Record

  • Returns a Mapper::Response

Importer Application:

  • Checks the Mapper::Response for inclusion of an error with type “Required field missing”

If first row is valid

  • Calls Mapper::DataHandler.validate(record) on all rows

  • Checks the Mapper::Responses for inclusion of any errors with type “Required field empty”

 

10

Post-validation screen: If Importer Application found a “Required field missing” error:

  • User is shown an error message stating their CSV is missing a column. They need to add that column, make sure every row in the column is populated, and start over. Name of missing column is given.

User cannot move forward in workflow from here.

If Importer Application found any “Required field empty” errors:

  • User is shown an error message stating that “x records are missing a values in required field(s)”. Name of missing field(s) is given.

User cannot move forward in workflow from here.

If Importer Application found no errors:

  • User sees a PROCESS button

 

 

11

User clicks PROCESS button

Importer Application:

  • Calls Mapper::DataHandler.process(record) on each Record

Mapper::DataHandler:

  • Merges in any default values from JSON config

  • Splits field values into arrays

  • Transforms split field values as specified by RecordMapper(1)

  • Quality checks transformed data, and compiles warnings(2)

  • Combines relevant values from multiple CSV fields into one CollectionSpace field(3)

  • Creates CollectionSpace XML record from combined data

  • Returns a Mapper::Response object that bundles:

    • hashes of the record at each stage of the process (original, merged, split, transformed, combined)(4)

    • identifier(5)

    • warnings of any data quality check failures

    • CollectionSpace XML document(6)

Importer Application:

  • As Mapper::Responses are returned, collates Response.identifier values to identify duplicates(7)

  • Looks up record in CollectionSpace instance to categorize it as New or Existing(8)

(1) Transforms currently include:

  • CSV input → refname for authority- and vocabulary-populated fields

  • CSV input (t, f, y, n, yes, no, true, false) → expected boolean field values

  • CSV input (1) → full Behrensmeyer vocabulary value (i.e. the refname label)

(2) Currently, quality checks result in warnings if:

  • Authority or vocabulary term used does not exist in specified authority/vocabulary

  • Static list/option list term used does not match a value in the specified list

On the roadmap:

  • value in field does not match field data type (for boolean, float, integer fields)

  • value in structured date field not parsed into scalar values (the string from CSV will still go in dateDisplayDate, but they might want to review)

  • value in non-structured date field not parseable as date

  • uneven number of values in fields within a field group or subgroup after data combination - may be desired/expected, or may indicate errors in how complex data was entered in CSV

  • more than one value in a non-repeatable field after data combination

(3) This handles, for example, putting ownerPerson and ownerOrganization values (which need to be imported and transformed to refnames separately), into one field for transformation to CollectionSpace XML.

(4) We may never surface this in the UI, but it will be invaluable for debugging unexpected Mapper results. KS: I may disable this by default but add an debug option to the process method

(5) I.e. objectNumber value for a CollectionObject, or shortIdentifier value for a Person

(6) We are able to map even data with lots of warnings to a valid CollectionSpace XML document, illustrating why these warnings should not block the workflow moving forward.

We discussed this happening as a separate step, since we may abandon the workflow and start over at this time, but since it is so fast compared to the transformation and identifier retrieval/generation steps, we decided to avoid another separate call to the Mapper. (The mapper still supports doing this all with two separate method calls, if we decide we want to)

(7) Note that the purpose of this is not to flag duplicate values from the CSV data that got turned into the CollectionSpace identifier (though it will do that). That is easily handled and should be done in the process of preparing the CSV.

The purpose here is to catch non-duplicate CSV values that become duplicate identifiers only after normalization and transformation. For example, “Bogotá, Columbia”, “Bogota, Columbia”, and “Bogota Columbia” become the same shortIdentifier value once normalized for CollectionSpace.

(8) We discussed whether there should be an option to turn this off to speed up the process if the User is certain all the records are new or existing.

12

User is shown progress indication while Records are processed

 

 

13

Post-processing screen: User is returned a report that clearly indicates:

  • Total number of records

  • Total number of new vs. existing records

  • Number of records having warnings

    • List of warnings received with count of records receiving each warning

This screen also displays a clear message if any duplicate identifiers were found in the set.

 

 

14

Post-processing screen:
User can click to download an augmented CSV.

This CSV contains all rows that were imported. It always adds two columns:

  • CSRecordStatus (new or existing)(1)

  • Identifier(2)

If there are warnings, one column is added per warning type. The column header is the warning type value. Each row representing a record with that warning type has the warning message value in that column.(3)

 

Rick: Any need to specify file naming/versioning conventions for these?

(1) User may want to remove records flagged as new or existing before proceeding, whether there are warnings or not. Given that we won’t be displaying the list of records in the UI, this is also how you would find which 3 of the records in a large set of “new records” are unexpectedly flagged as existing.

(2) If the duplicate identifier warning was displayed, User can find the sets of records ending up with duplicate identifiers via conditional formatting or pivoting on this column

(3) This allows easy/clean filtering by warning, which simplifies the process of tasks like:

  • extracting new CSV of authority or vocabulary terms to load in a separate process

  • examining/remediating a particular type of warning

  • removing columns/messages for warnings you’ve already remediated or have determined are not problematic in your data

Consider adding ability to export CSV for only certain sets of the records for v2:

  • only those with or without warnings

  • only those with a selected warning

  • only new or existing

  • only those implicated in duplicate identifiers warning

15

Post-processing screen:
User is shown two checkboxes:

  • Add new

  • Update existing

User can check one or both of the boxes.

If at least one box is checked, DO THE THING button is activated

 

I (KS) am suggesting here that User can add/update records with warnings!

We have discussed whether allowing updates with warnings could be turned on/off at the group level, or possibly managed per user level within a group (i.e. maybe admin and manager can update records with warnings, but members cannot?)

Warnings were intended to be an opportunity to verify that your data is as expected, but not block you from adding/updating records.

My use cases:

  • If I’m getting warning re: unknown option list values, maybe I need to add those values to my config. Maybe I don’t want to wait to have the records in the system. The value will be stored invisibly and when the new values are added to my config, those fields will now display in my previously added/updated records.

  • I want to get a warning about non-blank structured date fields that don’t get scalar values generated, so I can review those. If they are place names that accidentally ended up in a date field, I need to fix that. If they are some kind of odd date string I want to be able to ignore this warning and load the records.

  • The warning about uneven numbers of values across a field group/subgroup might be indicative of a data problem OR it might be fine. I want to be able to check, but not required to always have even numbers of values because that’s unrealistic

16

User clicks DO THE THING button.

If there are warnings (either duplicate identifier or on individual rows), some manner of “ARE YOU SURE?” confirmation happens and the user has to verify they want to proceed.

Importer Application:

  • filters to only new or only existing records if only one of the boxes was checked

  • transfers records in set to CollectionSpace using appropriate method for each record

 

17

User is shown a progress indicator

 

 

18

Post-transfer screen: User is shown:

  • number of failed new record additions

  • number of failed record updates

User can click on a link to download CSV of just the failures.

CSV includes all original data columns, plus a FailureMessage column.(1)

User also can click on a link to go back to the Post-processing screen.(2)

 

(1) User will need to examine and possibly re-process this data and only this data

(2) User may prefer to initiate add new vs. update existing processes separately. Or they may have decided they wanted to get the CSV after all.

19

And then….

 

Should workflow for managing/deleting batches be separate?

Post-conditions

  1. Users are able to view new or updated records in CollectionSpace

  2. Users are able to confirm that they have removed records from CollectionSpace