/
PAHMA Large Volume Data Loading Notes

PAHMA Large Volume Data Loading Notes

Version 1.6, May 2, 2011

Testing import service. Tried to load 6255 acquisitions records in a single XML file. Load "completed" in 16 minutes, but only 5630 records loaded because of a bug in the importer where records containing dollar signs in their data do not load. The records with dollar signs tend to be the longer records and those records make it part way through the load process, so it is hard to be exact about how many records loaded in 16 minutes. (Loading about 6255 acquisitions previously took 58.5 minutes and 1 hour 7 minutes through the java client api to the services at different times).

Version 1.3, February 11, 2011 -March 29, 2011

Data loading speeds and performance as I have been loading data incrementally are not substantially different with 1.3 than with 1.0.

March 7-8, 2011

Today, with about 6,500 acquisitions, 260,000 collection objects, and probably near 500,000 relations (as well as about 7,000 names and 1,500 organizations), I noticed that the cspace server now takes longer to restart than nuxeo: 12-plus minutes on the pahma-demo slice. The log on the pahma slice running version 1.0 shows the last cspace startup taking 37 seconds. Has something changed, or is there a problem with this slice?

The problem turned out to be (post-)init handlers that modify the data type on 5 fields in collection object (e.g., description and comments, which I had had to truncate) to change them to largetext. Mysql was doing the alter table even if it had already been done, and it was becoming very slow with lots of data. Aron filed a jira and the problem has been fixed for future releases. I've commented out the init handler section in the tenant bindings file (since the modifications have already been done), and the cspace server now restarts in 30-40 seconds.

March 10, 2011

There are now 300,000 objects in CSpace. Startup is extremely slow. It takes over 6 minutes before the acquisitions show up in MyCollectionSpace (where there is no longer a spinner, so it is not clear that anything is going to show up at all). The objects have not shown up after 30 minutes and may never (though they are searchable). The following show up in the slow query log during that time (the collection objects still have not arrived):

# Query_time: 198  Lock_time: 0  Rows_sent: 6247  Rows_examined: 12494
SELECT `hierarchy`.`id` AS `_C1` FROM `hierarchy` LEFT JOIN `collectionspace_core` ON `hierarchy`.`id` = `collectionspace_core`.`id` WHERE `hierarchy`.`primarytype` IN ('Acquisition') AND (`collectionspace_core`.`tenantid` = 1);
# Time: 110310 10:42:48
# User@Host: jboss[jboss] @ localhost [127.0.0.1]
# Query_time: 395  Lock_time: 0  Rows_sent: 299995  Rows_examined: 599990
SELECT `hierarchy`.`id` AS `_C1` FROM `hierarchy` LEFT JOIN `collectionspace_core` ON `hierarchy`.`id` = `collectionspace_core`.`id` WHERE `hierarchy`.`primarytype` IN ('CollectionObject') AND (`collectionspace_core`.`tenantid` = 1)

There are single-field indexes on id in both tables, hierarchy and collectionspace_core, and on primarytype in hierarchy; no index on tenantid. There are 2012253 rows in hierarchy and 830394 in collectionspace_core. This is on a slice with total 2G memory and no particular memory configuration for mysql.

March 11, 2011

With 300,000 collection objects in, I've loaded loan in and loan out records plus their relations with the collection objects that have been loaded (no relations for this set for loans in). There are 1896 loans out and 15 loans in plus about 1900 new relations.

March 16, 2011

Followup on indexes and speed for populating Find and Edit. I added the following indexes:

create index id_tid_ndx on collectionspace_core(id, tenantid)

create index id_pt_ndx on hierarchy(id, primarytype)

The first took about two minutes to create. After it was added, the find/edit page returns in about 4.5 minutes including, cataloging, acquisitions, loans in, and loans out. Three queries still show up in the slow query log (minus loans in, which has only a few records).

The second index took 13 hours to create(!) After it was added (that is, with both of the indexes present), the find/edit page returns in about 1 minute. Two queries still appear in the slow query log (primarytype Acquisition and primarytype CollectionObject), but query time is down to 23 seconds on both from 61 and 261 seconds previously (with first index only).

March 18, 2011

Changed the mysql configuration to "mysql-large.cnf" for large but not humongous databases (the huge conf is recommended for systems with lots of memory running mysql and nothing else). Now in v1.3 we have the added indexes plus the more optimized configuration. (The optimized configuration improved search times on the pahma slice running 1.0 in most cases. Search, which was already fast, did not improve. Loading a record with many relations improved somewhat but is still rather slow.)

Here are some new figures for v1.3 with more 300,000 cataloging records, etc., and optimized configuration:

Startup and populate MyCollectionSpace: 1 minute (same as before on 1.3 with added indexes)

open a cataloging record (no related acquisition): 14 seconds; was 1 minute 35 seconds on v1.0 before optimization with 200,000 recs

search for blouse: about 9 seconds? (160 records returned); was about 5 seconds in v1.0 with 200,000 recs (157 obj. returned)

search for blouse again later: about 4 seconds; not sure first count was accurate

search for mola: 5 seconds; was 3 seconds in v1.0 with 200,00 recs

open 3-30176: 5 seconds; was 45 seconds in v1.0 with 200,000 recs

open Acc.4734: 1 minute; was 1 minute 15 seconds in v1.0 with 200,000 recs (this acquisition record has over 40 related cataloging records and this is a known slow spot)

search for ear finger rings: about 3 seconds (1 record returned); was 2-3 seconds in v1.0 with 200,000 recs

open obj 6-22211: 11 seconds; was 45 seconds in v1.0 with 200,000 records

Searches may be a tad slower with additional records. Everything else is faster.

When I added the two indexes (see March 16 entry) to the slice running 1.0 with 200,000 cataloging records, etc., and a total of 1.2 million records in hierarchy rather than 2.0 million, the indexes took 31 seconds and 8 minutes, respectively, rather than 2 minutes and 13 hours to create (the indexes aren't really needed in that version but I wanted to create them to see how long it took).

March 29, 2011

I'm loading another 100,000 collection objects and their relations. New database configuration has not improved data loading times or improved problems with dropped connections (or whatever it is). For example, 41,000 collection objects with 20,000 relations took about 23 hours to load. The load may just stall after a certain number of records (generally somewhere between 50,000 and 100,000?). I believe it is important to restart nuxeo and cspace servers before starting again. Sometimes during a batch load there is a message like this but the retry apparently succeeds and load continues:

70637248 [main] INFO org.apache.commons.httpclient.HttpMethodDirector - I/O exception (org.apache.commons.httpclient.NoHttpResponseException) caught when processing request: The server 173.45.227.143 failed to respond
70637248 [main] INFO org.apache.commons.httpclient.HttpMethodDirector - Retrying request

April 1-4, 2011

With 200,000 more collection objects (total 500,000) and their acquisition relations added (not so many in this batch have acquisition relations though), starting up and populating MyCollectionSpace has gotten slower than before. It takes 6+ minutes to populate acquisitions, loans in, and loans out. Cataloging records never display in the UI (after 15+ minutes).

From slow query log:

# Time: 110401 17:56:56
# User@Host: jboss[jboss] @ localhost [127.0.0.1]
# Query_time: 53  Lock_time: 0  Rows_sent: 1895  Rows_examined: 3790
SELECT `hierarchy`.`id` AS `_C1` FROM `hierarchy` LEFT JOIN `collectionspace_core` ON `hierarchy`.`id` = `collectionspace_core`.`id` WHERE `hierarchy`.`primarytype` IN ('Loanout') AND (`collectionspace_core`.`tenantid` = 1);
# Time: 110401 17:58:56
# User@Host: jboss[jboss] @ localhost [127.0.0.1]
# Query_time: 117  Lock_time: 0  Rows_sent: 6247  Rows_examined: 12494
SELECT `hierarchy`.`id` AS `_C1` FROM `hierarchy` LEFT JOIN `collectionspace_core` ON `hierarchy`.`id` = `collectionspace_core`.`id` WHERE `hierarchy`.`primarytype` IN ('Acquisition') AND (`collectionspace_core`.`tenantid` = 1);
# Time: 110401 18:02:12
# User@Host: jboss[jboss] @ localhost [127.0.0.1]
# Query_time: 374  Lock_time: 0  Rows_sent: 499993  Rows_examined: 999986
SELECT `hierarchy`.`id` AS `_C1` FROM `hierarchy` LEFT JOIN `collectionspace_core` ON `hierarchy`.`id` = `collectionspace_core`.`id` WHERE `hierarchy`.`primarytype` IN ('CollectionObject') AND (`collectionspace_core`.`tenantid` = 1);   (**)

Free memory is 15876 k when running the last query above (71160 k when query completes); MySql is using 19% of memory when query (**) is running and is "top" process in linux top command.

Counts:

hierarchy: 3088206 rows (takes 304 seconds to return the  first time; subsequently takes 0 seconds to return indicating caching)

collectionspace_core: 1164250 rows (takes 16.44 seconds)

All queries seem to run much faster (from mysql prompt) if the Nuxeo and CSpace servers are stopped, indicating that system resources are an issue. When the other servers are stopped, mysql uses the same percentage of memory but CPU use goes up.

Furthermore, these queries can be faster or slower at different times of day depending on how much CPU is available to mysql (the ** query took about 5 and a half minutes in the morning, at which top showed it to be using 2-9 percent of CPU and 16% when the rows finally came back, whereas it is taking over 15 minutes this afternoon and top shows it to be using .3 % of CPU).

Queries also run faster if repeated soon after they've been run, indicating that caching is sometimes taking place. This makes it hard to make various comparisons.

I believe that replacing "in" in the queries above with an equals comparison does not make a difference but caching makes it hard to test adequately.

Opening various records is still comparable to before:

Open acquisition with 1 related object: 5 seconds

Open acquisition 4734 with > 40 related objects: 1 minute

Open cataloging record 6-22211 with 1 related acquisition: 10 seconds

Searches are still comparable to March 18 times:

blouse: 5 seconds (returns 204 records)

ear finger ring: about 3 seconds

Tentative conclusion: Startup is problematic: the queries don't return in time to populate the MyCollectionSpace page. Total 2G memory configuration on slicehost (and the amount of CPU available) is not enough to run Nuxeo and CSpace servers and MySQL with large configuration. Large configuration is probably still adequate with more memory and/or CPU cycles. It is likely database should be run on a dedicated server.

Comparison of Slicehost with a VPS in the Data Center, and with a stand-alone server

After copying the pahma-demo database and jboss instances to a different (non-slicehost) server in the
UC data-center, and to a stand-alone server -- a Dell 2650  -- also in the data-center, these tests were run,

with    499,993 collectionobjects_common records
      1,167,090 collectionspace_core records
      3,091,046 hierarchy records

after mysqld restart:                                        pahma-demo        bgcspace               Dell 2650
select count(*) from collectionobjects_common:             1-min 57-sec      1-min 19-sec               31-sec

immediately repeat:                                               0-sec            0-sec                 0-sec

select count(*) from hierarchy:                            4-min 15-sec      2-min 11-sec         1-min 19-sec

immediately repeat:                                               0-sec             0-sec                0-sec

select count(*) from collectionspace_core;                       26-sec            53-sec               23-sec

immedately repeat:                                                0-sec             0-sec                0-sec

SELECT `hierarchy`.`id` AS `_C1`
FROM `hierarchy` LEFT JOIN `collectionspace_core`
ON `hierarchy`.`id` = `collectionspace_core`.`id`
WHERE `hierarchy`.`primarytype`
IN ('CollectionObject')
AND (`collectionspace_core`.`tenantid` = 1)                5-min 24-sec      3-min 57-sec         2-min 57-sec

immediately repeat:                                        5-min 26-sec      3-min 21-sec               27-sec
immediately repeat, sending output to file:                5-min 40-sec      4-min 48-sec               29-sec

after login, My Collectionspace is populated in:           7-min 12-sec      6-min 35-sec         5-min 55-sec
but with no Cataloging Records listed (they never show up).


I also ran some of these tests on a new Dell Optiplex with 8GB of RAM - all of the above machines had 2GB of
RAM The complex SELECT took 1-min 12-sec, and when repeated it took 9-sec.  This is, I think, a direct result
of the increased memory.

select count(*) from hierarchy;                17-sec
select count(*) from collectionspace_core;      4-sec
select count(*) from collectionobjects_common;  2-sec

Interestingly, the time to paint the landing page (My Collectionspace) was 6-min 45-sec

Version 1.0, January 26 to February 3, 2011

1/26-28/11: The first 100,000 object records went very slowly. It took something over an hour to load each 1,000 object records with their relations to acquisitions (two ways, two relations). I don't have a total time because it was interrupted by some control characters in the data (cleaned up for the first 100,000 records) and the fact that some connections were not being released in the java (fixed). Also, it seems that the speed degrades over time. I hope to be able to come up better figures for the next 100,000. After the initial object records are loading, I have been loading four updates for fields coming from different parts of the old system or going into repeating groups in the object data, such as inscription (3 records in this batch), object collection information (64,822), object production information (5456), and references (2046). Each of these loads takes a long time. I am considering merging these into the main object data before loading in the future.

I have determined that at least for the updates, performance degrades over time. For example, in an update that begin at 6:00 in the evening http puts were running between 340-700 ms based on log data; by morning they mostly ranged from 700-1400. In one case, performance had gotten so bad that updates were taking one and a half and later two minutes all told; I don't have log figures for the puts on these. I discovered that restarting cspace, nuxeo, and mysql brings the speed back to normal again. Restarting just cspace does not fix the problem; restarting cspace and nuxeo seems to fix the problem (it is not clear whether restarting mysql would make an additional difference).

I tried adding an index on the csid in nuxeo's hierarchy table to see if this sped up the updates, but it made very little difference (so I removed it). At that point just adding and removing the index took a long time.

By the end, I removed the gets from the updates that checked whether the record exists before updating (since I had csids for them, they were all likely to exist). That sped up the process slightly, but the performance still degrades over time.

To come: memory figures at various points during load from jmx monitor.

(Other notes: 6973 person authority records loaded in 37 minutes; 1578 organization authority records loaded in 8 minutes)

Comparison of search and report generation with 10,000 and 100,000 objects coming soon. [SS]

1/31/11: Performance with 100,000 compared to 10,000 object records (in both cases 6973 person authority records, 1578 organization records, and 6248 acquisition records have been loaded--objects have the same number of acquisition records attached in both cases but acquisitions may have more objects) Note: no data loading is taking place in the background during these tests. Nuxeo was recently restarted in 100,000 record case but probably not in 10,000 record case (since I didn't know this could be an issue). Some figures were not collected for 10,000 objects

Summary: speed of searches is pretty much the same for 10,000 and 100,000 objects; they may take a bit longer if more objects are returned; startup appears to be slower with 100,000. Opening a record is more affected by the number of related records than by total number of records in system. Caching may be a factor (slowing down time to open first records and speeding up later)

Startup

Time from login until My CollectionSpace loaded: about 40-45 seconds with 100,000 (was about 10-14 seconds with 10,000?)

Open a Cataloging Record from Recently Added (first page): 20 seconds with 100,000 and no related acquisition (don't have figure for 10,000); later 15 seconds to an object with one related acquisition

Subsequent loads of My CollectionSpace: about 20 seconds with 100,000 objects

Searches

Search Cataloging records for blouse: about 4 seconds with 100,000 objects (returns 153 items); 3 seconds with 10,000 (returns 37 items)

Search Cataloging records for mola: about 1 second with 100,000 objects (returns 37 items); no figure for 10,000 with 23 items

Open 3-30176 from previous search: about 10 seconds with 100,000; 4 seconds with 10,000

Load Related acquisition 4734 with 40 related objects: 35-40 seconds with 100,000; 35 seconds with 10,000

Search Acquisitions for Gambia: 2 seconds with 100,000 items; same with 10,000 (2 acquisitions returned)

Open acquisition 4734 (same as above): 25 seconds with 100,000; was 35 seconds with 10,000

Open acquisition 2747 with no related objects: 5 seconds with 100,000; same with 10,000

Search Cataloging records for ear finger rings: 2-3 seconds with 100,000 (no figure for 10,000)

[SS]

2/1-3/11: Loading the next 100,000 objects. After the first 40,000, with 140,000 total objects in and after restarting the servers (actually the whole slice), we've started monitoring memory usage with jmx tool.  Will take a snapshot of memory graphs every few hours during the load, while recording the number of objects added since last server restart (note that relations are being added as well, but they won't be in this simple count). Then will monitor updates to objects, where the performance degradation was most obvious. I've removed a read after the load to speed up loading a bit, but I am still outputting debug information to the console. The program loaded 2,385 objects during the first hour, which is a great improvement over the first batch. Could possibly be faster without so much console output.

52,804 objects and their relations to acquisitions (100,000+) were loaded successfully between Feb. 1 and 5:07 p.m. and Feb. 2 at 10:21 (29+ hours). A final collection object was loaded that appears to have taken 73020 ms. (there was nothing special about this object record) and the program terminated with a null ID returned while trying to create its relations. The previous posts all took 400-700 ms., which is usual. I deleted the final collection object, restarted the servers, and resumed the import starting with this collection object: on restart the first post took 8169 ms (previous logs also show a longer time for the first one) and subsequent posts were back in the 400-500 range for collection objects. The final 3807 objects and 7612 relations loaded in 1 hour 51 minutes.

I hope we have memory monitor information for the period in question (yes, but nothing conclusive in it).

Searches, etc. with 200,000 objects: time until MyCollectionSpace opened: 2 minutes 20 seconds; open a cataloging record (no related acquisition): 1 minute 35 seconds; search for blouse: about 5 seconds (157 obj. returned); search for mola, 3 seconds; open 3-30176, 45 seconds; open Acc.4734: 1 minute 15 seconds; Acc.2747 now has 40 related objects: 1 minute 20 seconds; search for ear finger rings: still 2-3 seconds; open obj 6-22211: 45 seconds. Searches unaffected by greater number of records; opening objects and acquisitions is slower.

The MySQL slow query log shows the following query repeatedly with different ids (documenttype1 and documenttype2 are probably unindexed): SELECT `hierarchy`.`id` AS `_C1` FROM `hierarchy` LEFT JOIN `collectionspace_core` ON `hierarchy`.`id` = `collectionspace_core`.`id` LEFT JOIN `relations_common` ON `hierarchy`.`id` = `relations_common`.`id` WHERE `hierarchy`.`primarytype` IN ('Relation', 'Tagging') AND (`collectionspace_core`.`tenantid` = 1) AND (`relations_common`.`documentid1` = 'cfa503a1-7812-46d3-98fe') AND (`relations_common`.`documenttype2` = 'collectionobjects');

Added index on documentid1, documenttype2 to relations_common: open 3-30176 now takes about 5 seconds (rather than 45); open acc.4734 now takes 35 seconds rather than 1 minute 15 seconds.

Reports: acquisition-object report for %Hearst% returning 1,642 pages now takes 4.5 minutes; same report for %McKay% returning 1 page (4 lines): 1 minute 10 seconds (search is on refname); acquisition method report (obj-acq), 1 minute 35 seconds. I don't have figures for 10,000 objects, but I believe the last two would have been well under 30 seconds. The queries in these reports show up in mysql's slow query log (and they take way longer than the one above).