Categories
Design Java Linux Scalability

Results of Scalability Improvements for Project Voldemort

For the past several weeks I’ve been working on and off on an NIO implementation of Project Voldemort‘s socket server. Voldemort is a distributed hash table used at LinkedIn in order to scale certain workloads. Up until now, Voldemort used a blocking I/O (BIO), thread-per-socket model for its binary protocol. My implementation moved to supporting thousands of sockets on a single thread.

I wanted to share the results of some testing I’ve done comparing the existing blocking I/O (BIO) implementation of the socket server vs. my NIO implementation.

Test Environment

My testing environment is kept deliberately simple in order to encourage reproducibility. I’m running a single Voldemort instance on a server to which four client machines are pointing. I am simulating multiple clients by using multiple threads from each client.

In terms of the hardware involved, the server is an Intel Core 2 Duo 6700 (2.66 GHz) with 3.3 GB RAM[1]. The client machines are Intel Core 2 Duo 4300s (1.80 GHz) with a full 4 GB of RAM each. The server and clients are separated by Gigabit Ethernet on a dedicated switch.

The server and client machines are running Fedora 10 64-bit. The JVMs are standardized to version 1.6.0_14. The code is version 0.52 from my git-forked “nio-server” branch. Besides the changes made to my branch, I have made two small uncommitted changes as detailed in the patch below[2]. Basically they uncommitted changes increase the heap size for both client and server to 3 GB and increase the client timeouts to 10 minutes(!).

Client Setup

To perform the tests I’m using the voldemort-remote-test.sh script on each of the four client machines, updated to use 3 GB heap as stated above. I updated the voldemort.performance.RemoteTest to set some timeouts very long (10 minutes). I simply averaged out the result from all four machines over 100 iterations of the test.

Server Setup

The server is set up to use either the BIO or NIO implementation depending on the value of “enable.nio.connector” in server.properties. The value of “storage.configs” was set to “voldemort.store.noop.NoopStorageConfiguration,voldemort.store.bdb.BdbStorageConfiguration” as some tests used the no-op storage engine. When running the BIO implementation the value of “max.threads” was set as appropriate based on the total number of client threads. For the NIO implementation we run only two threads regardless of the fact that we’re serving tens of thousands of clients.

After each test the Voldemort server was shut down and restarted.

For both the client and the server I executed the command “ulimit -n 40000” prior to starting the JVM to run the tests in order to provide room for the socket file descriptors.

Test Results

So far I’ve run three tests comparing the BIO implementation to the NIO implementation. These all center around measuring the writes/second as reported by the voldemort-remote-test.sh script.

Transactions/second using No-op Storage Engine

The first test measured the transactions/second as the number of clients increased. The server is running the no-op storage engine to avoid skewing the results with BDB’s work (as is shown to be significant in the next test).

For this test, the invocation of voldemort-remote-test.sh provided these options:

Number of iterations: 100
Number of requests: 100,000
Operations: write and delete
Value size: 1024 bytes
Threads: as shown in the x-axis; each client ran ¼ of the threads

Here is a graph charting the results:

Transactions/second using No-op Storage Engine

While running the BIO implementation simulating 6,000 clients, the client application started throwing lots of connection timeout errors upon starting up. After several minutes the clients would generally recover and complete the number of iterations. However, the reporting output from the client would show that the number of successful operations was often less than the expected. For example, the delete operation would report “99,984 things deleted” instead of 100,000. This connection timeout behavior explains the increase in speed shown by the BIO implementation. As a fraction of the clients hung (and eventually timed out), the remaining clients would process their work that much faster, producing a more favorable measurement. These connection timeout errors were not seen when running the NIO implementation until much later.

On the server, the load average (as measured by top) was about ½ the number of server threads, e.g. around 4,000 for 8,000 clients when running the BIO implementation. For NIO, it was always ~2. For 8,000 clients, the heap size went to 2 GB for BIO while NIO was about 700 MB. This is largely attributed to the per-thread stack size.

Individual iteration measurements for clients varied wildly with the BIO implementation. Unfortunately I didn’t measure this, but I did note several occasions where iterations would yield 15,000 writes/sec and the next iteration only 4,000 writes/sec. Deviation for the NIO implementation was observed to be minimal, however.

The data for the BIO implementation ends abruptly before recording the results for 10,000 connections since the clients hung until they were forcibly stopped (via kill -9). This was attributed to the inability of the BIO implementation to keep up due to the fact that it was pegged in garbage collection.

At 20,000 connections the NIO implementation did cause one client machine to timeout on connections, though it did recover. At 32,000 connections a noticeable drop in performance was noted and at 34,000 the clients hung until forcibly stopped. This too was attributed to garbage collection times.

Transactions/second using BDB Storage Engine

The next test also measured the transactions/second as the number of clients increased. This time, however, the server is running the Berkeley Database storage engine to see how the socket server implementation is affected.

For this test, the invocation of voldemort-remote-test.sh provided these options:

Number of iterations: 100
Number of requests: 100,000
Operations: write and delete
Value size: 1024 bytes
Threads: as shown in the x-axis; each client ran ¼ of the threads

Here is a graph charting the results:

Transactions/second using BDB Storage Engine

As with the above test, around 6,000 clients with the BIO implementation the clients started throwing lots of connection timeout errors when starting up. At 8,000 clients this connection timeout behavior occurred for all four clients and explains the increase in speed shown by the BIO implementation. These connection timeout errors were not seen when running the NIO implementation until around 30,000 clients.

The BDB storage engine greatly affects the transactions/second that can be processed. My guess is that because the BDB code (and its threads) are competing for CPU time against thousands of other threads in the BIO implementation that it begins to become starved in terms of making progress. The NIO implementation, however, using a fixed two threads allows BDB to accomplish more work.

So even at a small number of clients, we achieve an increase of ~50-100% more transactions/second with the NIO implementation over the BIO implementation.

Transactions/second per Request Size

The next test also measured the transactions/second, but this time as the request size increased. The server is running the no-op storage engine to measure more clearly the nework I/O.

For this test, the invocation of voldemort-remote-test.sh provided these options:

Number of iterations: 100
Number of requests: 100,000
Operations: write and delete
Value size: as shown in the x-axis
Threads: 100 total, each client ran 25 threads

Here is a graph charting the results:

Transactions/second per Request Size

It’s pretty easy to see how the request size has an obvious effect on the number of transactions. Please note that the request sizes in the x-axis are doubling rather than linearly increasing. As can be seen in the graph, this appears to be a matter of JVM throughput as opposed to the nuances of the socket server implementation.

Conclusion

The implementation of the socket server delivers on NIO’s promise of better scalability with regard to using asynchronous I/O and readiness selection over the blocking I/O and thread-per-socket approach.

A Big Surprise

For a long time my NIO implementation suffered from weaker performance than the BIO implementation. I had scoured the code looking for problems but found nothing obvious. I had used the examples that I’d seen posted everywhere wherein the server used a single Selector to process readiness selection and for each ready socket, handed processing of to a thread in a thread pool.

What I found during profiling of the server was that ~30% of the runtime was spent in updating the SelectionKey‘s interestOps. Understandably when a SelectionKey updates its interestOps it requires acquiring a shared lock in the Selector implementation (at least for the 1.6 JDK on Linux). This caused a horrible serialization problem.

What I did instead was to move to a thread-per-Selector design wherein all of the I/O is done serially in a loop. There is literally one thread handling the I/O for tens of thousands of sockets. Because it’s done in the same thread, when the SelectionKey‘s interestOps is updated, the lock is found to already be held, thus no contention.

Even while I’m writing this I’m dumbfounded how a single thread that includes I/O and processing the operation could outperform a multi-threaded design. I’ve looked and looked and tested and tested and it’s almost always faster. The code for the NIO implementation is there, take a look and let me know what I’ve overlooked.

Moving Forward

The next step is to get the NIO implementation accepted into the Voldemort master repository. This would help so that others can peruse the code and employ it in their testing environments. I’d love to see it progress and improve based on others’ input. After that–should it stand up under scrutiny–it could become the default socket server implementation.

Well, I can dream 😉

1. The machine has 4 GB DIMMs installed, but BIOS stole 640 MB for some reason. Despite upgrading the BIOS (try doing that on a server running Linux without a floppy drive — not fun) and trudging through scores of forums I could find no solution. Suffice to say, I was/am very irritated. Is this an Asus thing?

2. Here’s an uncommitted patch that I used to increase the heap size and increase client timeouts:


diff --git a/bin/run-class.sh b/bin/run-class.sh
index ade56bb..a95d593 100755
--- a/bin/run-class.sh
+++ b/bin/run-class.sh
@@ -35,8 +35,8 @@ done
 CLASSPATH=$CLASSPATH:$base_dir/dist/resources

 if [ -z $VOLD_OPTS ]; then
-  VOLD_OPTS="-Xmx2G -server -Dcom.sun.management.jmxremote"
+  VOLD_OPTS="-Xmx3G -server -Dcom.sun.management.jmxremote"
 fi

 export CLASSPATH
-java $VOLD_OPTS -cp $CLASSPATH $@
\ No newline at end of file
+java $VOLD_OPTS -cp $CLASSPATH $@
diff --git a/bin/voldemort-server.sh b/bin/voldemort-server.sh
index 9878095..66a69e5 100755
--- a/bin/voldemort-server.sh
+++ b/bin/voldemort-server.sh
@@ -42,7 +42,7 @@ done
 CLASSPATH=$CLASSPATH:$base_dir/dist/resources

 if [ -z $VOLD_OPTS ]; then
-  VOLD_OPTS="-Xmx2G -server -Dcom.sun.management.jmxremote"
+  VOLD_OPTS="-Xmx3G -server -Dcom.sun.management.jmxremote"
 fi

 java $VOLD_OPTS -cp $CLASSPATH voldemort.server.VoldemortServer $@
diff --git a/test/integration/voldemort/performance/RemoteTest.java b/test/integration/voldemort/performance/RemoteTest.java
index 2a15955..29e945d 100644
--- a/test/integration/voldemort/performance/RemoteTest.java
+++ b/test/integration/voldemort/performance/RemoteTest.java
@@ -26,6 +26,7 @@ import java.util.List;
 import java.util.concurrent.CountDownLatch;
 import java.util.concurrent.ExecutorService;
 import java.util.concurrent.Executors;
+import java.util.concurrent.TimeUnit;
 import java.util.concurrent.atomic.AtomicInteger;

 import joptsimple.OptionParser;
@@ -132,6 +133,9 @@ public class RemoteTest {

         System.out.println("Bootstraping cluster data.");
         StoreClientFactory factory = new SocketStoreClientFactory(new ClientConfig().setMaxThreads(numThreads)
+                                                                                    .setConnectionTimeout(10, TimeUnit.MINUTES)
+                                                                                    .setRoutingTimeout(10, TimeUnit.MINUTES)
+                                                                                    .setSocketTimeout(10, TimeUnit.MINUTES)
                                                                                     .setMaxTotalConnections(numThreads)
                                                                                     .setMaxConnectionsPerNode(numThreads)
                                                                                     .setBootstrapUrls(url));