Reindexing full text

This documentation is copied from a place where it was (a) unfindable; (b) implied it was relevant to a single custom deployment of CollectionSpace; and (c) had not been updated since 2014. It is moved here because the main part of it seems to describe the current use/behavior of the Reindex Full Text batch job in the base application.

Overview

It is sometimes convenient, though highly discouraged, to update CollectionSpace data directly in the database, using SQL, rather than going through the CollectionSpace UI or REST API. After making changes this way, the Nuxeo full text index will usually need to be updated. Otherwise, the new values will not be visible to full text (keyword) search. The "Reindex Full Text" batch job may be used to accomplish the necessary reindexing.

Full Text Indexing

Full text search in Nuxeo is implemented through the fulltext database table. When a document is saved, Nuxeo business logic (implemented in Java) extracts the text in the document by normalizing and concatenating all of its field values. This normalized and concatenated text is stored in the simpletext column of the fulltext table. Nuxeo also extracts text from binary attachments (e.g. pdf or image files) where possible, and stores that text in the binarytext column of the fulltext table. In the database, a trigger nx_trig_ft_update updates the fulltext column of the fulltext table when simpletext or binarytext are updated. The fulltext column maintains a concatenation of simpletext and binarytext. Finally, a database index (fulltext_fulltext_idx) exists on the fulltext column. This index is used by full text queries.

When changes are made directly to the database, Nuxeo is unaware of them, and is not able to execute its business logic to recompute the simpletext and binarytext of the affected documents. The indexed text becomes out of sync with the actual content of the document.

The Reindex Full Text Batch Job

The "Reindex Full Text" batch job may be installed on UCB deployments. This batch job forces Nuxeo to recompute the simpletext and binarytext of specified documents.

Implementation

The batch job is adapted from the nuxeo-reindex-fulltext Nuxeo plugin. There is no public API in Nuxeo that just recomputes the full text for a document. The only way to trigger that logic is to save the document. So, for each document to be reindexed, the batch job makes a temporary change to an arbitrary field (dc:title), saves the document, changes the field back to its original value, and saves the document again. This is done using a low-level Nuxeo session, so that last modified date won't be changed, and Nuxeo event handlers won't fire.

Installing the Batch Job

Use the Services REST API to install the batch job in a tenant, by posting XML to /cspace-services/batch. An example curl command follows, where the XML is in a file called install-payload.xml. Replace <username>, <password>, and <hostname> with appropriate values.

curl -X POST -i -u "<username>:<password>" https://<hostname>/cspace-services/batch -T install-payload.xml

An example XML payload is shown below. The <forDocTypes> section is not needed for no-context mode invocations, but it is necessary to run the batch job in single or list modes (see below for descriptions of each mode). Any doctypes that may be need to be reindexed in single or list modes must be registered in the batch job's <forDocTypes>. Any doctype specified in <forDocTypes> will also have "Reindex Full Text" appear in the "Run Batch Process" dropdown of its record editor, if it is an editable record.

The XML shown registers the Reindex Full Text batch job to run on all doctypes known to the PAHMA 3.3 tenant. Doctypes in other tenants and versions may vary. The appropriate doctypes may be found by logging into the deployment's server. In the tomcat installation directory, there is a directory cspace/config/services/tenants/<tenantname>, which contains the file tenant-bindings.merged.xml. In that file, each doctype is represented by a service:object tag, and the doctype name is specified in the name attribute of that tag.

<?xml version="1.0" encoding="utf-8" standalone="yes"?> <document name="batch"> <ns2:batch_common xmlns:ns2="http://collectionspace.org/services/batch" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"> <name>Reindex Full Text</name> <notes>Recomputes the indexed full text of all or specified records.</notes> <className>org.collectionspace.services.batch.nuxeo.ReindexFullTextBatchJob</className> <supportsNoContext>true</supportsNoContext> <supportsSingleDoc>true</supportsSingleDoc> <supportsDocList>true</supportsDocList> <supportsGroup>false</supportsGroup> <createsNewFocus>false</createsNewFocus> <forDocTypes> <forDocType>Acquisition</forDocType> <forDocType>Batch</forDocType> <forDocType>CSNote</forDocType> <forDocType>Citation</forDocType> <forDocType>Citationauthority</forDocType> <forDocType>Claim</forDocType> <forDocType>Conceptauthority</forDocType> <forDocType>Conceptitem</forDocType> <forDocType>Contact</forDocType> <forDocType>Dimension</forDocType> <forDocType>Group</forDocType> <forDocType>Intake</forDocType> <forDocType>Loanin</forDocType> <forDocType>Loanout</forDocType> <forDocType>Locationauthority</forDocType> <forDocType>Locationitem</forDocType> <forDocType>ObjectExit</forDocType> <forDocType>Organization</forDocType> <forDocType>Orgauthority</forDocType> <forDocType>Person</forDocType> <forDocType>Personauthority</forDocType> <forDocType>Placeauthority</forDocType> <forDocType>Placeitem</forDocType> <forDocType>PublicItem</forDocType> <forDocType>Report</forDocType> <forDocType>Taxon</forDocType> <forDocType>Taxonomyauthority</forDocType> <forDocType>Vocabulary</forDocType> <forDocType>Vocabularyitem</forDocType> <forDocType>CollectionObject</forDocType> <forDocType>Blob</forDocType> <forDocType>Media</forDocType> <forDocType>Movement</forDocType> <forDocType>Relation</forDocType> </forDocTypes> </ns2:batch_common> </document>

Running the Batch Job

The Reindex Full Text batch job may be invoked through the UI (to reindex a single document, when the doctype has been registered in the batch job's forDocTypes list). It will more commonly be invoked through the Services REST API, by posting XML to the appropriate batch job. To find the csid of the Reindex Full Text batch job, use the REST API to list all batch jobs, by getting the URL /cspace-services/batch. Look for the result whose <name> is "Reindex Full Text", and find the corresponding <csid>. If there is no result with that name, the batch job has not been installed. Install the batch job before continuing.

An example curl command to execute the batch job follows, where the XML payload is in a file called reindex_payload.xml. Replace <username>, <password>, and <hostname> with appropriate values. Replace <csid> with the csid of the Reindex Full Text batch job.

curl -X POST -i -u "<username>:<password>" https://<hostname>/cspace-services/batch/<csid> -T reindex_payload.xml

The content of the XML payload will vary depending on the type of invocation.

Reindexing a Single Document (Single Mode Invocation)

There are two ways to reindex a single document. In either case, the document's doctype must appear in the forDocTypes list of the batch job. To check this, get the URL /cspace-services/batch/<csid>, where <csid> is the csid of the Reindex Full Text batch job. In the XML response, locate the forDocTypes section, and verify that the desired doctype appears. If it does not, update the batch job, adding the desired doctype before continuing.

To reindex a single document through the CollectionSpace UI, open the document. In the Run Batch Process dropdown, select "Reindex Full Text". Click Run. The document will be reindexed.

To reindex a single document through the REST API, post XML to the batch job, containing the csid of the document, and its doctype. An example payload follows:

Reindexing a List of Documents (List Mode Invocation)

A list of documents may be reindexed in a single batch invocation, as long as they all have the same doctype. That doctype must appear in the forDocTypes list of the batch job. To check this, get the URL /cspace-services/batch/<csid>, where <csid> is the csid of the Reindex Full Text batch job. In the XML response, locate the forDocTypes section, and verify that the desired doctype appears. If it does not, update the batch job, adding the desired doctype before continuing.

To reindex a list of documents through the REST API, post XML to the batch job, containing a list of document csids, and the doctype of the documents. All of the documents will be reindexed in a single database transaction. An example payload follows:

Be careful not to attempt to reindex too many documents in one batch invocation. An excessive number of csids may cause the batch job to generate an NXQL query that exceeds the allowable length. Attempting to reindex too many csids in one transaction may also exceed Nuxeo's transaction time limit, which will cause the entire job to be cancelled and rolled back. About 1,000 csids is a safe number to reindex at one time.

This script may be used to break up a list of csids into multiple list mode batch invocations: Tools/scripts/reindex_csids.pl at master ยท cspace-deployment/Tools

Reindexing Document Types, or All Documents (No-Context Mode Invocation)

All documents of a certain doctype may be reindexed in a single batch invocation. To do this, post XML to the batch job, containing the doctype to be reindexed. An example payload follows:

The doctype specified in the <docType> tag must appear in the forDocTypes list of the batch job. To check this, get the URL /cspace-services/batch/<csid>, where <csid> is the csid of the Reindex Full Text batch job. In the XML response, locate the forDocTypes section, and verify that the desired doctype appears. If it does not, update the batch job, adding the desired doctype before attempting to invoke the batch job. Alternatively, specify the doctype as a parameter, as described below.

Additional doctypes to be reindexed may be specified as parameters. An example payload follows:

If multiple doctypes are specified, the one specified in the top-level <docType> tag is reindexed first, followed by the doctypes specified in <params>, in the order in which they appear. In no-context mode, the top-level <docType> tag may be omitted, and all doctypes may be specified as parameters. Doctypes specified as parameters need not appear in the forDocTypes list of the batch job. All of those doctypes will be reindexed, even if they were not registered at installation time.

In no-context mode, if no doctypes are specified, all known doctypes will be reindexed. The following XML payload reindexes all documents in the system, one doctype at a time:

Some doctypes may contain millions of documents, so no-context mode should be used carefully. For each doctype to be reindexed, the batch job finds and iterates over documents with that doctype, and reindexes them in batches. Each batch is reindexed in a separate database transaction. The default batch size is 1,000. Some additional parameters may be specified to control this behavior:

Parameter Name

Default

Description

Parameter Name

Default

Description

batchSize

1000

The number of documents to reindex in a single database transaction. Each doctype to be reindexed is broken up into batches of this size.

batchPause

0

The number if milliseconds to pause before reindexing a batch of documents. This is useful to reduce system resources used by the batch job. Setting a small batchSize and a large batchPause provides a lot of time for other system activities. In practice, the defaults seem fine, even on PAHMA.

startBatch

1

The batch number at which to start reindexing. This is useful for continuing a previous reindexing that was stopped before all batches were completed. This parameter is only meant to be used when a single doctype is being reindexed, because batch numbering restarts at 1 for each doctype. For example, specifying a startBatch of 5 when multiple doctypes are to be reindexed means that for each doctype, reindexing will start at batch 5.

endBatch

0

The batch number after which to stop reindexing. If 0, stop after all documents in the doctype have been reindexed. This parameter is only meant to be used when a single doctype is being reindexed, because batch numbering restarts at 1 for each doctype. For example, specifying an endBatch of 10 when multiple doctypes are to be reindexed means that for each doctype, reindexing will end at batch 10.

The following example reindexes CollectionObjects, from batch number 10 through batch number 20, with a batch size of 1000, and a pause of 100 ms between batches:

Stopping the Batch Job

A Reindex Full Text job running in no-context mode may be stopped by creating a specially named stop file. Stop files must be placed in the directory <tomcat>/temp/org.collectionspace.services.batch.nuxeo.ReindexFullTextBatchJob, where <tomcat> is the tomcat installation directory. To stop the job after the current batch of documents has been reindexed, create a file called stopBatch in the directory. To stop the job after all documents in the current doctype have been reindexed, create a file called stopDocType. Be sure to remove the stop file before attempting to run the batch job again, or it will stop immediately.

Stopping tomcat will also terminate a running Reindex Full Text job.

Updating the Batch Job

It may be necessary to update the batch job after installation, to add registered doctypes, or change the name or notes. To do this, use the REST API to PUT an XML payload to the batch job. An example curl command follows, where the XML is in a file called update-payload.xml. Replace <username>, <password>, and <hostname> with appropriate values. Replace <csid> with the csid of the Reindex Full Text batch job.

An example XML payload is shown below, which changes the <forDocTypes> of the batch job:

Troubleshooting

Debugging and progress information are logged in the services log (cspace-services.log). If the Reindex Full Text batch job is stopped because of an exception, that log will contain the exception information, and will show the doctype and batch number being reindexed (if in no-context mode). Using this information, it's possible to resume the batch job from the last batch that was successfully reindexed.

The most common cause of unexpected stoppage of the job is because of transaction timeouts occurring when the system is busy. If this happens, the job can be re-executed later (in case of single and list mode invocations), or resumed at the last successful batch (in case of no-context invocations). If a list mode invocation times out, it may be necessary to reduce the number of csids in the invocation, and break it up into multiple invocations.

When running the job in no-context mode or in list mode with many csids, the POST may finish after a few minutes, with a 500 error. This is likely just to be a timeout of the request by apache, with the batch job continuing to run in tomcat. Use the services log to track progress.