User Manual: CSV Importer: Media Handling: Ingesting Files
You can import media files associated with Media Handling procedures.
- 1 Formatting CSV data for media file ingest
- 2 Requirements for successful media ingest
- 3 Questions and Answers
Formatting CSV data for media file ingest
The last column in the Media Handling CSV template has the header: mediaFileURI
.
This column is used to record the URI for a media file you wish to ingest on the Media Handling record represented by the row. See Requirements for successful media ingest for more info on the URI requirements.
If the mediaFileURI
column is populated, the media file located at the URI will be ingested if:
the Media Handling record is being newly created in CollectionSpace; OR
the Media Handling record currently exists in CollectionSpace but does not already have a media file associated with it.
If a Media Handling record exists in CollectionSpace, but already has a media file associated with it, any data from other columns will be updated in that record, but the Importer will ignore the “mediaFileURI” in that row.
The CSV Importer does not verify or validate mediaFileURI
values. It does not access the files. It just passes along the URI to the CollectionSpace application with the rest of your Media Handling record information.
Upon receiving this data, the CollectionSpace application itself creates or updates the Media Handling record and attempts to ingest the media file in the same way it would if you manually entered the mediaFileUri
value as the File Information > Upload (external media) > URL value and saved the record.
Requirements for successful media ingest
mediaFileURI
must begin with http:// or https://
The CollectionSpace application knows how to request remote media files via HTTP protocols.
Non-hosted, locally managed CollectionSpace instances only
You may be able to provide file:// protocol URIs pointing to files existing on the same machine as your CollectionSpace instance (or reachable from it). Consult with the person who administers your CollectionSpace and CSV Importer for details. More information in the “URL Protocols” section of: Blob Service RESTful API
The location/service in which files are hosted for ingest must not throttle requests for files
Doing a batch ingest with media files causes many file requests to be sent very quickly. This is easily detected as non-human activity because of the scale and speed with which it occurs.
If a server or service is configured to detect automated file requests and respond by throttling (e.g. responding increasingly slowly as more requests are made) or blocking them, media ingest will fail. If access is blocked, CollectionSpace will not be able to download the file for ingest. If access is throttled past a certain point, CollectionSpace will begin to assume it cannot access the file(s) because it is having to wait too long for a response.
Good places to put files for ingest
In publicly accessible directories on a Web server
AWS (Amazon Web Services) S3 bucket with access policies set to public
Potentially workable places to put files for ingest
Dropbox
We were informed several years ago that a CollectionSpace user was able to ingest media files from Dropbox. We have no details on whether the Dropbox terms of use/service may have changed, or what kind of account this user had.
Places that do not work for ingest
Google Drive
Known to begin blocking/throttling file requests after 15-30 files have been retrieved.
Report your own experiences ingesting to help other users
Have you recently used Dropbox and it still works? Did you try to ingest from Sharepoint with no success? Let us know and we can expand or update this list.
File at mediaFileURI
must be accessible to the CollectionSpace application
As described above, the CSV Importer merely passes your mediaFileURI
values to the CollectionSpace application instance it is connected to.
If, when you try to visit the mediaFileURI
in your browser, you are asked to:
log in
deal with login verification or authentication codes
prove you are not a robot
Then, the CollectionSpace application will not be able to access the files. In this situation, the CollectionSpace application IS a robot, and it has no mechanism for storing your login information on third-party sites, or navigating those sites' authentication processes.
Note that, in the next section, example #2 cannot be accessed by CollectionSpace, even when we use the direct URL, presumably because the server where the file lives is configured to reject underlying system-to-system requests. The server where the files are hosted for ingest must allow this kind of request.
mediaFileURI
points directly to the media file you wish to ingest
The following URIs may look as though they point to a image files:
https://commons.wikimedia.org/wiki/File:Guinea_fowl_keet_2_(cropped).jpg
https://www.digitalcommonwealth.org/search/commonwealth:zk520p665
However, in each case, the link goes to a WEB PAGE presenting a view of the image, with metadata and other contextual information about the image, wrapped in the interface of the overall website/digital collection.
Each of these URIs behaves differently if you ingest it as a media file in CollectionSpace, but none of them ingest the image as you might expect:
No error. CollectionSpace thinks it’s a .jpg since the URI ends with that, so all looks good. But if you try to access the “.jpg” from the Media Handling record, you get an error, since the HTML file that was actually ingested can’t be displayed as a .jpg.
Error. CollectionSpace rejects this
No error. CollectionSpace ingests a “.dms” file with a mime-type of “application/octet-stream”. Accessing the file from the Media Handling record requires you to download the .dms file to your computer. If you open that file locally with a text editor, it is the HTML source code of the image web page
The URIs to actually ingest these images would be:
https://upload.wikimedia.org/wikipedia/commons/9/9e/Guinea_fowl_keet_2_(cropped).jpg
https://tile.loc.gov/storage-services/service/pnp/ppmsca/59600/59644v.jpg - This would work, except that it seems this site blocks programmatic access to its images. You can view and download the .jpg manually in your browser, but when the CollectionSpace application requests the file at the URI, it receives a 400 error.
https://differentdomain.net/derivatives/images/imageid/image_access_800.jpg - Fake URI showing the pattern – This source does not want direct access to their images programmatically available
Questions and Answers
Can I ingest more than one file on a given Media Handling record?
No. The Media Handling procedure is intended to describe a single media file, so only one file should be attached to a given Media Handling procedure.
Ingesting a second media file on an existing record via the CollectionSpace Services API will cause the link to the first media file to be removed. The CSV Importer does not permit you to do this, since doing it accidentally could be very destructive to your CollectionSpace data.