User Manual: CSV Importer: Media Handling: Ingesting Files

You can import media files associated with Media Handling procedures.

Formatting CSV data for media file ingest

The last column in the Media Handling CSV template has the header: mediaFileURI.

This column is used to record the URI for a media file you wish to ingest on the Media Handling record represented by the row. See Requirements for successful media ingest for more info on the URI requirements.

If the mediaFileURI column is populated, the media file located at the URI will be ingested if:

  • the Media Handling record is being newly created in CollectionSpace; OR

  • the Media Handling record currently exists in CollectionSpace but does not already have a media file associated with it.

If a Media Handling record exists in CollectionSpace, but already has a media file associated with it, any data from other columns will be updated in that record, but the Importer will ignore the “mediaFileURI” in that row.

The CSV Importer does not verify or validate mediaFileURI values. It does not access the files. It just passes along the URI to the CollectionSpace application with the rest of your Media Handling record information.

Upon receiving this data, the CollectionSpace application itself creates or updates the Media Handling record and attempts to ingest the media file in the same way it would if you manually entered the mediaFileUri value as the File Information > Upload (external media) > URL value and saved the record.

Requirements for successful media ingest

mediaFileURI must begin with http:// or https://

The CollectionSpace application knows how to request remote media files via HTTP protocols.

Non-hosted, locally managed CollectionSpace instances only

You may be able to provide file:// protocol URIs pointing to files existing on the same machine as your CollectionSpace instance (or reachable from it). Consult with the person who administers your CollectionSpace and CSV Importer for details. More information in the “URL Protocols” section of: Blob Service RESTful API

The location/service in which files are hosted for ingest must not throttle requests for files

Doing a batch ingest with media files causes many file requests to be sent very quickly. This is easily detected as non-human activity because of the scale and speed with which it occurs.

If a server or service is configured to detect automated file requests and respond by throttling (e.g. responding increasingly slowly as more requests are made) or blocking them, media ingest will fail. If access is blocked, CollectionSpace will not be able to download the file for ingest. If access is throttled past a certain point, CollectionSpace will begin to assume it cannot access the file(s) because it is having to wait too long for a response.

Good places to put files for ingest

  • In publicly accessible directories on a Web server

  • AWS (Amazon Web Services) S3 bucket with access policies set to public

Potentially workable places to put files for ingest

  • Dropbox

    • We were informed several years ago that a CollectionSpace user was able to ingest media files from Dropbox. We have no details on whether the Dropbox terms of use/service may have changed, or what kind of account this user had.

Places that do not work for ingest

  • Google Drive

    • Known to begin blocking/throttling file requests after 15-30 files have been retrieved.

Report your own experiences ingesting to help other users

Have you recently used Dropbox and it still works? Did you try to ingest from Sharepoint with no success? Let us know and we can expand or update this list.

File at mediaFileURI must be accessible to the CollectionSpace application

As described above, the CSV Importer merely passes your mediaFileURI values to the CollectionSpace application instance it is connected to.

If, when you try to visit the mediaFileURI in your browser, you are asked to:

  • log in

  • deal with login verification or authentication codes

  • prove you are not a robot

Then, the CollectionSpace application will not be able to access the files. In this situation, the CollectionSpace application IS a robot, and it has no mechanism for storing your login information on third-party sites, or navigating those sites' authentication processes.

Note that, in the next section, example #2 cannot be accessed by CollectionSpace, even when we use the direct URL, presumably because the server where the file lives is configured to reject underlying system-to-system requests. The server where the files are hosted for ingest must allow this kind of request.

mediaFileURI points directly to the media file you wish to ingest

The following URIs may look as though they point to a image files:

  1. https://commons.wikimedia.org/wiki/File:Guinea_fowl_keet_2_(cropped).jpg

  2. https://www.loc.gov/item/2018757362/

  3. https://www.digitalcommonwealth.org/search/commonwealth:zk520p665

However, in each case, the link goes to a WEB PAGE presenting a view of the image, with metadata and other contextual information about the image, wrapped in the interface of the overall website/digital collection.

Each of these URIs behaves differently if you ingest it as a media file in CollectionSpace, but none of them ingest the image as you might expect:

  1. No error. CollectionSpace thinks it’s a .jpg since the URI ends with that, so all looks good. But if you try to access the “.jpg” from the Media Handling record, you get an error, since the HTML file that was actually ingested can’t be displayed as a .jpg.

  2. Error. CollectionSpace rejects this

  3. No error. CollectionSpace ingests a “.dms” file with a mime-type of “application/octet-stream”. Accessing the file from the Media Handling record requires you to download the .dms file to your computer. If you open that file locally with a text editor, it is the HTML source code of the image web page

The URIs to actually ingest these images would be:

  1. https://upload.wikimedia.org/wikipedia/commons/9/9e/Guinea_fowl_keet_2_(cropped).jpg

  2. https://tile.loc.gov/storage-services/service/pnp/ppmsca/59600/59644v.jpg - This would work, except that it seems this site blocks programmatic access to its images. You can view and download the .jpg manually in your browser, but when the CollectionSpace application requests the file at the URI, it receives a 400 error.

  3. https://differentdomain.net/derivatives/images/imageid/image_access_800.jpg - Fake URI showing the pattern – This source does not want direct access to their images programmatically available

Questions and Answers

Can I ingest more than one file on a given Media Handling record?

No. The Media Handling procedure is intended to describe a single media file, so only one file should be attached to a given Media Handling procedure.

Ingesting a second media file on an existing record via the CollectionSpace Services API will cause the link to the first media file to be removed. The CSV Importer does not permit you to do this, since doing it accidentally could be very destructive to your CollectionSpace data.