Recommendations

This page contains recommendations derived from the test runs.



Managed vs External datastreams:

If digital objects contain datastreams that are of control group "M" (managed), they should be located either on the ingesting machine or on a machine with a very fast low-latency network connection in order to minimize retrieval time. Datastreams should be made available using a web server instead of sending them to the upload servlet.

Triplestores

Two triplestores were tested during the ingest: MPT and Mulgara (v 1.2). Both triplestores show similar performance characteristics during ingest. There are however a few thoughts to consider:
  • MPT is not a true semantic store. It consists of multiple database tables used for subject and object and a lookup table that stores the predicates. However, even a database containing more than 100M triples performed reasonably well in terms of insertion and retrieval of triples.
  • Mulgara is a semantic store built specifically for usage in semantic applications. However, as of version 1.2 there was a significant performance penalty using the sync() call which persists triples to the underlying physical storage. The sync() call took up to 500ms. For the ingest this is not an issue, because it is not necessary to call sync() often. However, in a scenario with CRUD operations depending on each other, sync() has to be called more often leading to severe performance penalties. This issue has not yet been tested for Mulgara 2.0.

For a more comprehensive comparison of MPT and Mulgara see http://www.slideshare.net/cwilper/mptstore-a-fast-scalable-and-stable-resource-index.

Java

Even though the performance of Java 5 and Java 6 looks quite similar in this case, it is still advisable to use Java 6 if possible for the following reasons:
  • The Java 6 VM has better self-tuning capabilities and therefore needs less attention.
  • Java 6 provides more transparency through enhanced management and diagnostic capabilities.

For 64 bit architectures it is generally advisable to use the 64 bit version of Java. However, more memory is needed (up to 30%) due to the larger address space. Additionally if the machine running Fedora Commons has more than 2G of RAM, the 64 bit version should be used in order to utilize the full amount of available memory.
Tuning the Heap and Garbage Collector is, at least for Java 6, often not necessary. The heap should however be dimensioned appropriately in order to avoid frequent garbage collections and avoid OutOfMemoryErrors. Minimum and maximum heap size should be equal so that the Java VM doesn't have to readjust the heapsize.

Database Servers

MySQL and Postgres have been tested and were found to yield roughly equal performance when tuned properly. The usage of MySQL ISAM is discouraged because it lacks critical features necessary for this kind of use case.

It is generally advisable to run the database engine on a separate machine or at least provide a separate disk for the database (especially WAL log). This will speed up database operations and improve performance up to 30% in some cases.

Repository Size

We have tested the ingest of 14 million digital objects (see here). Ingest times have remained stable at all times and we haven't seen any unusual occurrences. The ingest took three weeks total.

Add new attachment

In order to upload a new attachment to this page, please use the following box to find the file, then click on “Upload”.
« This page (revision-3) was last changed on 18-Aug-2008 11:43 by KST [RSS]