Introduction

The purpose of this page is to display the measurements taken from ingests using huge amounts of data. We have a large patents database which contains approximately 1.4 million patents. In order to prove that Fedora Commons can handle sets of data this large and even larger using standard retail hardware, we ingested the complete patents database. We repeated the process up to 10 times, so that the repository eventually contained nearly 14 million digital objects.

Full Import Of Patents Data #1

Setup

  • Fedora 3.0b1, default installation
  • MPT Triplestore, local
  • Java 1.6
  • Managed content datastreams
  • Database: Postgres, local
  • Ingest from local machine, only one feeding instance

Statistics:

#total (ms)avg (ms)min (ms)max (ms)
1,411,242275,695,3911954717.697

Total duration of ingest: approx. 3.5 days

The image below shows that the ingest time was fairly constant over time.

Full Import Of Patents Data #2

Setup

  • Standard Fedora installation
  • MPT Triplestore, remote host
  • Java 1.6
  • Managed content
  • Database: Postgres, remote host
  • Ingest from local machine, only one feeding instance
  • Usage of Java NIO for managed content retrieval

Statistics:

#total (ms)avg (ms)min (ms)max (ms)
1,409,833169,965,1041204114,202

Total duration of ingest: approx. 2.5 days

The image below shows that the ingest time of #2 was significantly lower than at #1. This can largely be attributed the following factors:

  • Remote database for repository and triplestore
  • Java NIO for retrieval of managed content (file copy operation on local machine)
  • Remote logging (no io)

The overlay of both ingests #1 and #2 shows the clear performance gain of almost 40%.

Full Import Of Patents Data #3 (14 million digital objects)

Setup

  • Fedora 3.0b2, default installation
  • MPT Triplestore, remote
  • Java 1.6
  • Managed content datastreams
  • Database: Postgres, remote
  • Ingest from local machine, only one feeding instance
  • Total duration of ingest: 13 days

Statistics:

#avg (ms)min (ms)max (ms)std dist. (ms)
13,904,40584.8728.0020456.6762.98

Number of Triples: 742,186,874

Environment:

  • Java: 1.6.0_04, 64 bit
  • Tomcat: 5.5

Client JVM:

JAVA_OPTS="$JAVA_OPTS \
-Xms756m \
-Xmx756m \
-XX:+DisableExplicitGC"

Server JVM:

JAVA_OPTS="$JAVA_OPTS \
-Xms1g \
-Xmx1g \
-XX:+DisableExplicitGC \
-XX:NewSize=512m \
-XX:MaxNewSize=512m \
-XX:SurvivorRatio=10 \
-XX:TargetSurvivorRatio=90 \
-XX:MaxTenuringThreshold=30"

Database:

  • remote host
  • WAL on separate partition
  • shared_buffers = 96MB
  • temp_buffers = 16MB
  • max_fsm_pages = 204800
  • fsync = on
  • synchronous_commit = off
  • full_page_writes = on
  • wal_writer_delay = 10000ms
  • effective_cache_size = 356MB
  • datestyle = 'iso, mdy'
  • lc_messages = 'en_US.UTF-8'
  • lc_monetary = 'en_US.UTF-8'
  • lc_numeric = 'en_US.UTF-8'
  • lc_time = 'en_US.UTF-8'

Fedora Server:

  • separate data partition, filesystem ext3 rw,data=ordered 0 0
  • fedora partition reiserfs

The two images below show the complete ingest process. The outcome proofs that Fedora Commons can handle the ingest of 14 million digital objects. There are however several factors to be considered:

  • The patents data are relatively homogenous in size and composition. Other tests should be conducted using more diverse and inhomogenous data.
  • The underlying triplestore was MPT. Another test using Mulgara as triplestore should be done.
  • Several thousand objects did not get ingested despite being well formed and valid. The reason for this has yet to be determined.

Add new attachment

In order to upload a new attachment to this page, please use the following box to find the file, then click on “Upload”.

List of attachments

Kind Attachment Name Size Version Date Modified Author Change note
png
fedrep3_managed_full_avg_10x_1... 5.112 kB 1 Mon Jul 21 13:06:56 CEST 2008 KST
png
fedrep3_managed_full_avg_10x_2... 5.129 kB 1 Mon Jul 21 13:07:06 CEST 2008 KST
png
fedrep6_managed_full_1vs2_600x... 8.312 kB 1 Mon May 19 17:57:54 CEST 2008 KST
png
fedrep6_managed_full_2500x800.... 44.768 kB 1 Wed May 14 18:36:14 CEST 2008 KST
png
fedrep6_managed_full_2_600x400... 5.971 kB 1 Mon May 19 17:57:48 CEST 2008 KST
png
fedrep6_managed_full_600x400.p... 6.331 kB 1 Wed May 21 12:42:52 CEST 2008 KST
« This page (revision-2) was last changed on 21-Jul-2008 13:07 by KST [RSS]