How To Try Out Hive on Your Local Machine — And Not Upset Your Ops Team

According to the Hive web site:

Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files.

Hive is built on top of various technologies, the most notable being Hadoop and HDFS. As a result, to run Hive, you need access to a Hadoop cluster running a job tracker, task trackers, DFS nodes, and so on. You will also need an external database (MySQL, PostgreSQL, etc.) to store Hive’s meta data.

Even if you have such an environment available to you, for exploratory reasons it may be faster and less error prone to work in a sandbox. It’s very easy to seemingly delete files from HDFS using common Hive commands (loading data from HDFS will by default move the file to the Hive-managed location). While you’re finding your way around Hive, it might be best to isolate yourself from the outside world (metaphorically, mind you) to avoid causing any problems.

So then, how can you try out Hive on your local machine, and in a safe sandbox? Well, it’s actually pretty easy.

First, you’ll need to download and build Hive from the source.

Next, before running Hive, simply export the following environment variable:

export HIVE_OPTS="-hiveconf mapred.job.tracker=local \
   -hiveconf fs.default.name=file://`pwd`/tmp \
   -hiveconf hive.metastore.warehouse.dir=file://`pwd`/tmp/warehouse \
   -hiveconf javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=`pwd`/tmp/metastore_db;create=true"

(Sorry about the WordPress formatting; watch out for the last line in the above…)

The HIVE_OPTS environment variable is used by the Hive command-line utility to provide overrides to the default Hive configuration. (Note: some references use the incorrect name HIVE_OPT (i.e. missing the S) which, of course, causes the values to be silently ignored.) Setting it prior to running the bin/hive script will cause the values set in the environment variable to be used instead.

Just for completeness, let’s review the settings:

  • mapred.job.tracker – This is a standard Hadoop configuration option to point to the URL of the job tracker. The magic value of “local” will cause Hadoop to be run on the local machine instead.
  • fs.default.name – This is another standard Hadoop configuration option to specify the root of the distributed file system used by Hadoop (often HDFS, but not necessarily). By using a file://-based URL, we use our local file system, which–for tests–is likely sufficient.
  • hive.metastore.warehouse.dir – This setting is specific to Hive, and is the directory name (relative to the fs.default.name) in which Hive’s warehouse data is stored. Again, we use a local file system-specific path.
  • javax.jdo.option.ConnectionURL – This is a standard JDBC URL used by Hive to connect to its meta data store. Using the value of jdbc:derby:;databaseName=`pwd`/tmp/metastore_db;create=true" allows us to use a local, embedded Derby database with its files stored on the local file system.

At this point, you should be ready to run Hive. Notice that as you create, load, and query from your database, the directory under the current directory (named “tmp”) is populated with a number of files. And guess what? If you want to start all over from scratch, simply exit Hive, delete that directory, and everything is new again.

Also of value in your travels is to set the logging level from the command line:

export HIVE_OPTS="-hiveconf hive.root.logger=DEBUG,console"

You can adjust the logging level to what you need. Another great benefit of running Hadoop locally is that you can get the debug logging to your console for both the Hive client and map/reduce execution.

Results of Scalability Improvements for Project Voldemort

For the past several weeks I’ve been working on and off on an NIO implementation of Project Voldemort‘s socket server. Voldemort is a distributed hash table used at LinkedIn in order to scale certain workloads. Up until now, Voldemort used a blocking I/O (BIO), thread-per-socket model for its binary protocol. My implementation moved to supporting thousands of sockets on a single thread.

I wanted to share the results of some testing I’ve done comparing the existing blocking I/O (BIO) implementation of the socket server vs. my NIO implementation.

Test Environment

My testing environment is kept deliberately simple in order to encourage reproducibility. I’m running a single Voldemort instance on a server to which four client machines are pointing. I am simulating multiple clients by using multiple threads from each client.

In terms of the hardware involved, the server is an Intel Core 2 Duo 6700 (2.66 GHz) with 3.3 GB RAM[1]. The client machines are Intel Core 2 Duo 4300s (1.80 GHz) with a full 4 GB of RAM each. The server and clients are separated by Gigabit Ethernet on a dedicated switch.

The server and client machines are running Fedora 10 64-bit. The JVMs are standardized to version 1.6.0_14. The code is version 0.52 from my git-forked “nio-server” branch. Besides the changes made to my branch, I have made two small uncommitted changes as detailed in the patch below[2]. Basically they uncommitted changes increase the heap size for both client and server to 3 GB and increase the client timeouts to 10 minutes(!).

Client Setup

To perform the tests I’m using the voldemort-remote-test.sh script on each of the four client machines, updated to use 3 GB heap as stated above. I updated the voldemort.performance.RemoteTest to set some timeouts very long (10 minutes). I simply averaged out the result from all four machines over 100 iterations of the test.

Server Setup

The server is set up to use either the BIO or NIO implementation depending on the value of “enable.nio.connector” in server.properties. The value of “storage.configs” was set to “voldemort.store.noop.NoopStorageConfiguration,voldemort.store.bdb.BdbStorageConfiguration” as some tests used the no-op storage engine. When running the BIO implementation the value of “max.threads” was set as appropriate based on the total number of client threads. For the NIO implementation we run only two threads regardless of the fact that we’re serving tens of thousands of clients.

After each test the Voldemort server was shut down and restarted.

For both the client and the server I executed the command “ulimit -n 40000” prior to starting the JVM to run the tests in order to provide room for the socket file descriptors.

Test Results

So far I’ve run three tests comparing the BIO implementation to the NIO implementation. These all center around measuring the writes/second as reported by the voldemort-remote-test.sh script.

Transactions/second using No-op Storage Engine

The first test measured the transactions/second as the number of clients increased. The server is running the no-op storage engine to avoid skewing the results with BDB’s work (as is shown to be significant in the next test).

For this test, the invocation of voldemort-remote-test.sh provided these options:

Number of iterations: 100
Number of requests: 100,000
Operations: write and delete
Value size: 1024 bytes
Threads: as shown in the x-axis; each client ran ¼ of the threads

Here is a graph charting the results:

Transactions/second using No-op Storage Engine

While running the BIO implementation simulating 6,000 clients, the client application started throwing lots of connection timeout errors upon starting up. After several minutes the clients would generally recover and complete the number of iterations. However, the reporting output from the client would show that the number of successful operations was often less than the expected. For example, the delete operation would report “99,984 things deleted” instead of 100,000. This connection timeout behavior explains the increase in speed shown by the BIO implementation. As a fraction of the clients hung (and eventually timed out), the remaining clients would process their work that much faster, producing a more favorable measurement. These connection timeout errors were not seen when running the NIO implementation until much later.

On the server, the load average (as measured by top) was about ½ the number of server threads, e.g. around 4,000 for 8,000 clients when running the BIO implementation. For NIO, it was always ~2. For 8,000 clients, the heap size went to 2 GB for BIO while NIO was about 700 MB. This is largely attributed to the per-thread stack size.

Individual iteration measurements for clients varied wildly with the BIO implementation. Unfortunately I didn’t measure this, but I did note several occasions where iterations would yield 15,000 writes/sec and the next iteration only 4,000 writes/sec. Deviation for the NIO implementation was observed to be minimal, however.

The data for the BIO implementation ends abruptly before recording the results for 10,000 connections since the clients hung until they were forcibly stopped (via kill -9). This was attributed to the inability of the BIO implementation to keep up due to the fact that it was pegged in garbage collection.

At 20,000 connections the NIO implementation did cause one client machine to timeout on connections, though it did recover. At 32,000 connections a noticeable drop in performance was noted and at 34,000 the clients hung until forcibly stopped. This too was attributed to garbage collection times.

Transactions/second using BDB Storage Engine

The next test also measured the transactions/second as the number of clients increased. This time, however, the server is running the Berkeley Database storage engine to see how the socket server implementation is affected.

For this test, the invocation of voldemort-remote-test.sh provided these options:

Number of iterations: 100
Number of requests: 100,000
Operations: write and delete
Value size: 1024 bytes
Threads: as shown in the x-axis; each client ran ¼ of the threads

Here is a graph charting the results:

Transactions/second using BDB Storage Engine

As with the above test, around 6,000 clients with the BIO implementation the clients started throwing lots of connection timeout errors when starting up. At 8,000 clients this connection timeout behavior occurred for all four clients and explains the increase in speed shown by the BIO implementation. These connection timeout errors were not seen when running the NIO implementation until around 30,000 clients.

The BDB storage engine greatly affects the transactions/second that can be processed. My guess is that because the BDB code (and its threads) are competing for CPU time against thousands of other threads in the BIO implementation that it begins to become starved in terms of making progress. The NIO implementation, however, using a fixed two threads allows BDB to accomplish more work.

So even at a small number of clients, we achieve an increase of ~50-100% more transactions/second with the NIO implementation over the BIO implementation.

Transactions/second per Request Size

The next test also measured the transactions/second, but this time as the request size increased. The server is running the no-op storage engine to measure more clearly the nework I/O.

For this test, the invocation of voldemort-remote-test.sh provided these options:

Number of iterations: 100
Number of requests: 100,000
Operations: write and delete
Value size: as shown in the x-axis
Threads: 100 total, each client ran 25 threads

Here is a graph charting the results:

Transactions/second per Request Size

It’s pretty easy to see how the request size has an obvious effect on the number of transactions. Please note that the request sizes in the x-axis are doubling rather than linearly increasing. As can be seen in the graph, this appears to be a matter of JVM throughput as opposed to the nuances of the socket server implementation.

Conclusion

The implementation of the socket server delivers on NIO’s promise of better scalability with regard to using asynchronous I/O and readiness selection over the blocking I/O and thread-per-socket approach.

A Big Surprise

For a long time my NIO implementation suffered from weaker performance than the BIO implementation. I had scoured the code looking for problems but found nothing obvious. I had used the examples that I’d seen posted everywhere wherein the server used a single Selector to process readiness selection and for each ready socket, handed processing of to a thread in a thread pool.

What I found during profiling of the server was that ~30% of the runtime was spent in updating the SelectionKey‘s interestOps. Understandably when a SelectionKey updates its interestOps it requires acquiring a shared lock in the Selector implementation (at least for the 1.6 JDK on Linux). This caused a horrible serialization problem.

What I did instead was to move to a thread-per-Selector design wherein all of the I/O is done serially in a loop. There is literally one thread handling the I/O for tens of thousands of sockets. Because it’s done in the same thread, when the SelectionKey‘s interestOps is updated, the lock is found to already be held, thus no contention.

Even while I’m writing this I’m dumbfounded how a single thread that includes I/O and processing the operation could outperform a multi-threaded design. I’ve looked and looked and tested and tested and it’s almost always faster. The code for the NIO implementation is there, take a look and let me know what I’ve overlooked.

Moving Forward

The next step is to get the NIO implementation accepted into the Voldemort master repository. This would help so that others can peruse the code and employ it in their testing environments. I’d love to see it progress and improve based on others’ input. After that–should it stand up under scrutiny–it could become the default socket server implementation.

Well, I can dream ;)

1. The machine has 4 GB DIMMs installed, but BIOS stole 640 MB for some reason. Despite upgrading the BIOS (try doing that on a server running Linux without a floppy drive — not fun) and trudging through scores of forums I could find no solution. Suffice to say, I was/am very irritated. Is this an Asus thing?

2. Here’s an uncommitted patch that I used to increase the heap size and increase client timeouts:


diff --git a/bin/run-class.sh b/bin/run-class.sh
index ade56bb..a95d593 100755
--- a/bin/run-class.sh
+++ b/bin/run-class.sh
@@ -35,8 +35,8 @@ done
 CLASSPATH=$CLASSPATH:$base_dir/dist/resources

 if [ -z $VOLD_OPTS ]; then
-  VOLD_OPTS="-Xmx2G -server -Dcom.sun.management.jmxremote"
+  VOLD_OPTS="-Xmx3G -server -Dcom.sun.management.jmxremote"
 fi

 export CLASSPATH
-java $VOLD_OPTS -cp $CLASSPATH $@
\ No newline at end of file
+java $VOLD_OPTS -cp $CLASSPATH $@
diff --git a/bin/voldemort-server.sh b/bin/voldemort-server.sh
index 9878095..66a69e5 100755
--- a/bin/voldemort-server.sh
+++ b/bin/voldemort-server.sh
@@ -42,7 +42,7 @@ done
 CLASSPATH=$CLASSPATH:$base_dir/dist/resources

 if [ -z $VOLD_OPTS ]; then
-  VOLD_OPTS="-Xmx2G -server -Dcom.sun.management.jmxremote"
+  VOLD_OPTS="-Xmx3G -server -Dcom.sun.management.jmxremote"
 fi

 java $VOLD_OPTS -cp $CLASSPATH voldemort.server.VoldemortServer $@
diff --git a/test/integration/voldemort/performance/RemoteTest.java b/test/integration/voldemort/performance/RemoteTest.java
index 2a15955..29e945d 100644
--- a/test/integration/voldemort/performance/RemoteTest.java
+++ b/test/integration/voldemort/performance/RemoteTest.java
@@ -26,6 +26,7 @@ import java.util.List;
 import java.util.concurrent.CountDownLatch;
 import java.util.concurrent.ExecutorService;
 import java.util.concurrent.Executors;
+import java.util.concurrent.TimeUnit;
 import java.util.concurrent.atomic.AtomicInteger;

 import joptsimple.OptionParser;
@@ -132,6 +133,9 @@ public class RemoteTest {

         System.out.println("Bootstraping cluster data.");
         StoreClientFactory factory = new SocketStoreClientFactory(new ClientConfig().setMaxThreads(numThreads)
+                                                                                    .setConnectionTimeout(10, TimeUnit.MINUTES)
+                                                                                    .setRoutingTimeout(10, TimeUnit.MINUTES)
+                                                                                    .setSocketTimeout(10, TimeUnit.MINUTES)
                                                                                     .setMaxTotalConnections(numThreads)
                                                                                     .setMaxConnectionsPerNode(numThreads)
                                                                                     .setBootstrapUrls(url));

On the Trials of Using Websphere's JMS Provider

I was recently tasked with getting a stand-alone JMS-based client running using Websphere’s built-in JMS provider. I was confident that with just a bit of administration, a set of JAR files, and some jndi.properties values I could get something up and running in an hour or maybe two…

After about twelve hours I finally finished a prototype. The Java code was exceedingly simple and non-Websphere specific, but the Websphere-specific administration and settings were a nightmare. Here I’ve retraced my steps in case anyone else ever needs to do this again.

Pay very, very, very close attention to Websphere’s OS-specific requirements regarding system dependencies.

Hurdle 1: Installation

Getting the Software

The environment that I was using was a Fedora 8-based server and client. On the server we’re running Websphere 6.1 on the server as it matches what my client is running. On the server I’m using IBM’s JRE (of course) while on the client I wanted to specifically try to be JRE-agnostic, so I chose to use a late-model Sun 1.6 JRE.

The first piece of software to grab is Websphere itself. This was downloaded as was.cd.6100.trial.base.linux.ia32.tar.gz.

To satisfy Websphere’s requirements for Fedora simply run yum via root:

    # yum -y install compat-libstdc++-* compat-db

Lastly, you’ll need to get the so-called IBM Client for JMS on J2SE with IBM WebSphere Application Server installation package. This is downloaded as sibc_install-o0810.09.jar.

Installation

Installation is graphical by default. I tried to figure out how to install Websphere in some sort of console mode but all my efforts proved fruitless. (It is supposed to have a console and silent install mode, I just couldn’t get it to work.) So I begrudgingly fired up a VNC session to get to a UI.

First, extract was.cd.6100.trial.base.linux.ia32.tar.gz to a temporary directory and then run the script named launchpad.sh. Simply click to install the trial, using the default location of /opt/IBM/WebSphere/AppServer. Run the first steps console to perform “Installation verification” or to “Start the server.”

Note, if clicking the installer link in the UI appears to do nothing, please double-check that you’ve installed any OS-specific dependencies. This step took about two hours to realize why it wasn’t starting. Since Fedora isn’t an officially supported OS, I used the RHEL 4 directions.

At this point you can connect to Websphere’s “Integrated Solutions Console” (AKA the administration front-end) from another machine via the browser:


http://host:9060/ibm/console

Please substitute host with the host on which Websphere is installed.

Once you get the “Integrated Solutions Console” UI up, you’ll need to log in before you do anything else.

Hurdle 2: Configuring Websphere’s JMS Provider

Websphere’s administration UI is fairly clean, but knowing what you’re supposed to do to make it work is something else altogether. Note: there doesn’t seem to be any centralized documentation for Websphere. Every other question I had resulted in a 5-30 minute Google session to find the answer.

Configuring the Bus

The first thing to do is to configure Websphere’s “bus.” You have to do this before you set up any JMS-specific settings.

So – in the “Integrated Solutions Console” UI, expand the “Service Integration” node and then click “Buses”. Click “New” to create a new bus and name it simply “MyBus” and then click “Next” and “Finish”.

Note: You’ll be seeing the message “Changes have been made to your local configuration” a lot. Simply click the link to “Save” the configuration. We’ll restart the server from the command line once we’re all finished.

Next, go to the “MyBus” details screen. Click the link named “Bus members” and then “Add”. You’ll be asked to specify a “Server”, “Cluster”, or “WebSphere MQ server”. Select “Server” and click “Next”. Leave the “type of message store” as its default (“File store”) and click “Next”. The “message store properties” can be left as-is; click “Next” and then “Finish”. Don’t forget to “Save” the configuration as described above.

Next, go to the following screen: “Buses”, then “MyBus”, and then “Destinations”. Click “New” to create a new Destination. When prompted to “Select destination type”, choose “Topic space”, and click “Next”. Enter the “Identifier” “MyTopicSpace” and click “Next”. Leave the “Assign the queue to a bus member” select as-is and click “Next”, then “Finish”, and then save the configuration.

Configuring JMS Resources

In the “Integrated Solutions Console” UI, expand the “Resources” node in the UI, expand the “JMS” node, and then click “JMS providers”. First, select a “Scope” of “Node=Node01, Server=server1″ where is the host on which Websphere is installed. After the page has refreshed, click the link “Default messaging provider”. Click “Connection factories” and then “New”. We’re going to leave most of the options blank, but do enter the “Name” as “MyConnectionFactory” and the “JNDI name” as “jms/MyConnectionFactory”. For the “Bus name” select “MyBus” and then wait for the screen to refresh. Next enter the value “host:7276″ for the “Provider endpoints” option where host is the host on which Websphere is installed. Next, click “OK” and then save the configuration.

Next, expand the “Resources” node in the UI, expand the “JMS” node, and then click “JMS providers”. The “Scope” of “Node=Node01, Server=server1″ should still be selected. Click the link “Default messaging provider”, then click “Topics”, and then “New”. We’re also going to leave most of these options blank, but do enter the “Name” as “MyTopic” and the “JNDI name” as “jms/MyTopic”. For the “Bus name” select “MyBus” and then wait for the screen to refresh. Next select the “Topic space” named “MyTopicSpace” and wait for the screen to refresh. Next, click “OK” and then save the configuration.

Restarting Websphere

Don’t forget to restart! We need to restart so that our new configuration will become effective.

Simply drop into a terminal window and execute the following:

    # /opt/IBM/WebSphere/AppServer/bin/stopServer.sh server1 && sleep 5 && /opt/IBM/WebSphere/AppServer/bin/startServer.sh server1

Hurdle 3: Developing a Stand-alone Java Client

We’re finally ready to tackle wiring up our client to our new JMS server. Fortunately, this piece is pretty straightforward ;)

Installing the Client-side Libraries

It took me quite a bit of time to find the necessary client-side files to talk to Websphere’s JMS provider. These aren’t simply a set of JARs in the Websphere installation somewhere. They’re not up on any Maven repository or anything. Remember that scary sounding file that we downloaded before — sibc_install-o0810.09.jar? That file isn’t the JMS libraries but an installer for the libraries. So, on the client, run the installer thusly:

    $ java -jar /tmp/sibc_install-o0810.09.jar jms_jndi_sun -silent /tmp/jms

Note that the above assumes you’re running a Sun-based JRE on the client. If you’re running under IBM’s JRE you’ll need to use jms_jndi_ibm instead of jms_jndi_sun.

In the /tmp/jms/lib directory are the three JARs you’ll need to have in your CLASSPATH.

(Man, I’m so spoiled by Maven.)

Configuring JNDI Access to JMS

The last bit of the puzzle is the JNDI context configuration to connect to the JMS server. Although it took me about three hours to find these settings, they’re all that are needed. Simply put the following entries in jndi.properties (or Spring or wherever):

    java.naming.provider.url=iiop://stampy:2809
    java.naming.factory.initial=com.ibm.websphere.naming.WsnInitialContextFactory
    com.ibm.CORBA.ORBInit=com.ibm.ws.sib.client.ORB

That should be all you need.

Conclusion

My hat goes off to all Websphere administrators and developers. I have apparently been enjoying a very carefree existence in the lightweight/Spring/POJO world for the last five or so years.

If there’s one thing I’ve learned from this experience, it is the value of patience ;)

Setting up Tomcat Development in Eclipse

In this post I’ll outline how to get the Tomcat source code up and running in Eclipse. This is the opposite of what most people want to do: debug web applications using Tomcat in Eclipse. This is about downloading the source code, building, and debugging Tomcat itself.

We’ll be using Linux as the development environment, but any sensible environment should work with a few tweaks. I’m also using the latest version of the 1.6 JDK.

Getting the Source

The first thing you’ll need to do is to download the sources for Tomcat. There are a couple of ways to do this, but I’m going to use the up-to-date SVN repository for the trunk. So, open a shell and enter the following:

mkdir /tmp/tomcat
cd /tmp/tomcat/
svn co http://svn.apache.org/repos/asf/tomcat/trunk
cd trunk/
echo "base.path=/tmp/tomcat/trunk/downloads" > build.properties
ant download
ant
ant -f extras.xml

Creating the Eclipse Project

Now we turn our attention to setting up an Eclipse project that can build and execute Tomcat.

First, open Eclipse and start with the workspace of “/tmp/tomcat/trunk”. Create a new project using the “Java Project from Existing Ant Buildfile” option. Browse to the file “”/tmp/tomcat/trunk/build.xml” and use the “compile” target, then click “Finish”.

Immediately you’ll see hundreds of errors – don’t panic! We’ll fix these up by updating the classpath. View the project options for our new project and select to edit the “Java Build Path”. We’ll need to remove the “/tmp/tomcat/trunk/${ant.jar}” as this points to nothing (yet). Also remove the “JRE_LIB” entry as this is deprecated (so, why did Eclipse create it???).

To fix up the classpath, do the following:

  1. Click “Add External JARs” and add the JAR found at $ANT_HOME/lib/ant.jar
  2. Click “Add External JARs” and add the two JARs in /tmp/tomcat/trunk/output/extras/webservices
  3. Click “Add Library”, select “JRE System Library”, click “Next”, and then select the default (1.5/1.6) JDK and click “Finish”.

After closing the project properties, the workspace should build and — hopefully — there will be no more compilation/dependency errors.

Running Tomcat

Let’s now get Eclipse to run Tomcat from our source:

  1. Right click on the project, select “Run As…”
  2. Create a new “Java Application” configuration with a main class of “org.apache.catalina.startup.Bootstrap”
  3. Click on the “Arguments” tab and enter the “Program arguments” of “start”
  4. Select a “Working directory” of “Other” with a value of “${workspace_loc:Tomcat 6.0}/../output/build”
  5. Click “Run”

To verify that everything is working, point your browser to:

http://localhost:8080/examples/servlets

Hopefully this will help you set up your project in Eclipse to start hacking on the source code to Tomcat.

ByteBuffer.duplicate() Does Not Preserve Byte Order

I’ve just spent the last two hours pulling out my hair wondering why my ByteBuffer duplication code doesn’t work.

I’m having fun with a side project to create a lightweight CIFS server in Java. Since the CIFS protocol is in little endian byte order, I have to explicitly specify the byte order via the order API:

public static ByteBuffer allocate(int length) {
	ByteBuffer byteBuffer = ByteBuffer.allocate(length);
	byteBuffer.order(ByteOrder.LITTLE_ENDIAN);
	return byteBuffer;
}

Having a little utility method such as the above makes it easy to stop worrying about the byte ordering.

Many CIFS requests contain two sections – one section for request parameters and one section for request data. This seemed like a good place to use the duplicate API to split the one ByteBuffer into two “views”. (Note, this still uses the underlying buffer so you’re not making extra copies.)

However, my first pass at the code contained a subtle, but nasty bug. Suddenly values in the duplicated buffer were coming out with crazy values but the original buffer worked fine. And apparently I’m not the only one who’s been tripped up by it. A bug was filed over five years ago against this very API call, noting that the ordering is not preserved from the original to the duplicate. My favorite comment from the Java Bug Parade entry is:

No rightminded developer wants his duplicated ByteBuffer to behave differently than the original.

I would agree ;)

Fortunately this is easy to fix, and creating another utility method for it makes it even clearer:

public static ByteBuffer duplicate(ByteBuffer buffer) {
	ByteBuffer duplicateBuffer = (ByteBuffer)buffer.duplicate();
	duplicateBuffer.order(buffer.order());       // duplicate() does not preserve ordering!
	return duplicateBuffer;
}

This little method then preserves my ByteOrder.LITTLE_ENDIAN setting to avoid future problems. Plus, I can hide away the pre-Generics cast from Buffer to ByteBuffer in one place ;)

Introducing the Java Concurrency APIs

I recently had the privilege of helping one of my clients conduct technical interviews for a Senior Java Developer position. I enjoy interviewing candidates in part because it gives me a chance to learn something. While conducting the interviews I did learn a few things, so I’m happy :) However, one unfortunate thing I learned is that not many developers are familiar with the Java Concurrency APIs. Only one developer out of about 20 that we interviewed had used it, and only two or three had even read up on it. Now, I certainly don’t expect that everyone know every API out there. But in the case of this particular position, we needed someone who knew how to write multi-threaded applications correctly. There are many reasons to get up to speed with the APIs, especially if your applications operate in a multi-threaded environment as my client’s does.

So – I want to share some data that may motivate you to take a further look into the Concurrent APIs if you haven’t done so already. But first let’s take a 30,000 foot view of the APIs. (If you’re really impatient, skip down to the graphs, then come back here to find out what they mean.)

Overview of the Concurrency APIs

The java.util.concurrent package was introduced via JSR-166 and is present as a standard part of the Java libraries as of Java 5. However, for those unfortunate souls still on JDK versions 1.3 and 1.4 there are at least two mainstream back-port implementations of the concurrent package:

This refutes the notion that the Concurrency APIs are only for those whose product requires Java 5 or above.

The Concurrency APIs offer many different interfaces and classes to help with your multi-threaded application. So we’ll focus on one specific subset of the Concurrency APIs found in the java.util.concurrent.lock package: the Lock interface. An object of type Lock is an object representation of a lock in the JVM, similar to the synchronization primitive offered by the synchronized keyword. Rather than use a specific keyword, the Lock object represents the lock using class(es) and methods on which the lock is enabled, disabled, and so forth. The core Lock API contains a handful of methods, but the two most important are:

public interface Lock {

    public void lock();

    public void unlock();

}

These are the main methods used to perform — you guessed it! — locking and unlocking. OK, admittedly, the Lock interface doesn’t look convincing enough to run out and switch all your tried-and-true-and-tested code over to the Lock API. But let’s look at another interface in the java.util.concurrent.locks package: ReadWriteLock. Here is the interface in its entirety:

public interface ReadWriteLock {

    public Lock readLock();

    public Lock writeLock();

}

Wow – pretty short! But as you’ve probably guessed, the ReadWriteLock API is a step toward implementing the read-write locking pattern in your code. Let’s note the JavaDoc for the interface for what this pattern can achieve:

A ReadWriteLock maintains a pair of associated locks, one for read-only operations and one for writing. The read lock may be held simultaneously by multiple reader threads, so long as there are no writers. The write lock is exclusive.

So if the usage pattern of a particular piece of synchronized data is read-mostly, it’s possible to have more than one reader thread in the critical section. But when a write does occur, all other readers (and writers) will block until the current writer is finished. But unless a write is being performed, the code can achieve a higher degree of parallelism.

Usage of the Concurrency APIs

To see how straightforward it is to use the Lock as a replacement for the synchronized keyword, let’s use an example. My canonical example is, of course, a cache:

public interface Cache<K, V> {

    public void put(K key, V value);

    public V get(K key);

}

Let’s look at the pre-Concurrency API implementation that uses “classic” synchronization:

public class ClassicallySynchronizedCache<K, V> implements Cache<K, V> {

    private Map<K, V> cache = new HashMap<K, V>();

    public void put(K key, V value) {
        synchronized (cache) {
            cache.put(key, value);
        }
    }

    public V get(K key) {
        synchronized (cache) {
            return cache.get(key);
        }
    }

}

Pretty straightforward. Our cache allows for insertions and retrievals and uses synchronization to ensure that we don’t have any concurrency issues internal to our java.util.HashMap. Now let’s see how the code can be revised using the Lock API:

public class LockCache<K, V> implements Cache<K, V> {

    private Map<K, V> cache = new HashMap<K, V>();

    private Lock lock = new ReentrantLock();

    public void put(K key, V value) {
        try {
            lock.lock();
            cache.put(key, value);
        } finally {
            lock.unlock();
        }
    }

    public V get(K key) {
        try {
            lock.lock();
            return cache.get(key);
        } finally {
            lock.unlock();
        }
    }

}

As was mentioned, there are no longer blocks of code demarcated by the synchronized keyword. Instead, we have replaced that with a Lock object upon which we lock and unlock around our critical section.

Now let’s implement our Cache using the ReadWriteLock interface:

public class ReadWriteLockCache<K, V> implements Cache<K, V> {

    private Map<K, V> cache = new HashMap<K, V>();

    private ReadWriteLock lock = new ReentrantReadWriteLock();

    public void put(K key, V value) {
        try {
            lock.writeLock().lock();
            cache.put(key, value);
        } finally {
            lock.writeLock().unlock();
        }
    }

    public V get(K key) {
        try {
            lock.readLock().lock();
            return cache.get(key);
        } finally {
            lock.readLock().unlock();
        }
    }

}

The natural progression of the code is to separate the locks used to manage the reads/gets and writes/puts. These are both managed by the ReadWriteLock implementation to ensure that the locks coordinate and function as defined.

Results

But – as they say – the proof is in the pudding. Does all of this actually have any benefit for parallelization? Let’s take a comparative look of these three implementations running under different scenarios.

These scenarios usage different usage patterns to show how the different locking schemes affect parallelization. These results are generated using a benchmark tool I wrote to test out different locking schemes. The benchmark tests each of the three above styles of locking in a parameterized fashion so as to approximate a specific usage pattern. The test machine is a Core 2 Duo E6700 with 2 GB of RAM running Linux Fedora with Java 1.6.0_01. All listed times are expressed in milliseconds.

Scenario 1: Read-mostly, Fast Reads, Infrequent Writes

The first locking scenario is characterized by:

  • Reading threads outnumber writing threads 4:1
  • Read threads each perform 10,000,000 reads from the cache
  • Read lock duration short (reading shared data takes near-zero time)
  • Write lock duration medium (writing shared data occurs every second and takes 10 milliseconds)

Here we see that simply switching to a Lock-based locking scheme improves performance by 5x over the old-style synchronized approach even though the degree of parallelization is the same (only one reader or writer in the critical section concurrently). Interestingly, we note that while the ReadWriteLock does offer better parallelization (via more than one reader in the critical section at a time), the overhead of the internal locking algorithms is enough to degrade performance to where speed wins out over parallelization.

Scenario 2: Read-mostly, Slow Reads, Infrequent Writes

The second locking scenario is characterized by:

  • Reading threads outnumber writing threads 4:1
  • Read threads each perform 100 reads from the cache
  • Read lock duration long (reading shared data takes one millisecond)
  • Write lock duration medium (writing shared data occurs every second and takes 10 milliseconds)

Here we see that simply switching to a Lock-based locking scheme does not improve performance over the old-style synchronized approach. However, just look at how much more parallelization we achieve by taking advantage of the ReadWriteLock mechanism — performance increases by nearly 40x! We achieve better parallelization by allowing more than one reader in the critical section at a time. Since the reads are comparatively slow, the overhead of the internal locking algorithms does not adversely affect performance. On the other hand, the other two implementations begin blocking most severely.

Scenario 3: Read-mostly, Fast Reads, Frequent Writes

The third locking scenario is characterized by:

  • Reading threads outnumber writing threads 4:1
  • Read threads each perform 100,000,000 reads from the cache
  • Read lock duration short (reading shared data takes near-zero time)
  • Write lock duration medium (writing shared data occurs every 10 milliseconds and takes 10 milliseconds)

Again, simply switching to a Lock-based locking scheme improves performance by over 4x over the old-style synchronized approach with the same degree of parallelization. Interestingly, we again notice the overhead of the ReadWriteLock internal locking algorithm’s affect on performance, though still over 2x that of the synchronized approach.

Scenario 4: Equal Read and Write Threads, Fast Reads, Infrequent Writes

The last locking scenario is characterized by:

  • Reading threads and writing threads 1:1
  • Read threads each perform 10,000,000 reads from the cache
  • Read lock duration short (reading shared data takes near-zero time)
  • Write lock duration medium (writing shared data occurs every second and takes 10 milliseconds)

Yet again, simply switching to a Lock-based locking scheme improves performance by over 4x over the old-style synchronized approach with the same degree of parallelization. Interestingly, we again notice the overhead of the ReadWriteLock internal locking algorithm’s affect on performance, though the increased parallelization yields a nearly 3x performance improvement over that of the synchronized approach.

Conclusion

So what can we conclude from this? That we should immediately switch to using the Lock and/or ReadWriteLock APIs? No.

The point is:

No one locking mechanism will consistently perform better than another. Review your usage patterns, the locking mechanism they use, and test performance. Use the right tool for the job.

As usual, there are several factors involved with a given usage pattern:

  • Number of concurrent reader threads attempting to access the shared data structure
  • Time needed to read a value from the shared data structure
  • Number of concurrent writer threads attempting to modify the shared data structure
  • Time needed to write a value to the shared data structure
  • Time between write accesses

Of course, this list isn’t exhaustive, but it gives us an idea what factors come into play.

Remember too that the boxes that run our applications are being imbued with the ability to do more in parallel via multiple CPUs and N-core CPUs. A noticed trend is for CPU designers to rely more on the number of cores rather than raw performance of each core itself. That is, the number of cores may increase faster than the speed of each core. It behooves developers to ensure they’re using the appropriate amount of parallelization.

So – take a look at the Concurrency APIs if you haven’t already. There are many more gems in it that make it worth your while.

The Dangers of Escapism

I’m the kind of guy that has a ‘word of the day’ calendar and will use odd words during a conversation in an attempt to use the word sometime during the day. I especially enjoy when I learn a new word (or a new meaning of a word) that succinctly describes a concept or thought that I’ve been grasping to articulate. One such word that I recently came across describes a software design problem: “escapism.” Now I haven’t found a rigorous text book definition of the word, but I’ll do my best to define it:

Escapism: In software design, the inadvertent release of an object reference outside its current scope.

The usage of the qualifier “inadvertent” here is important because the programmer would probably find such behavior to be undesirable, though not always obvious.

One common source of escaped objects is the release of internal object references to external code. Programmers will work long and hard to encapsulate any internal objects a class uses in an effort to provide confidence in the object’s state, usage, etc. Anything that undermines that confidence should be addressed.

So, let’s look at a detailed example at where this form of escaping objects can cause problems.

Imagine a programmer creates a class—UserCache—to cache User objects. Part of the purpose of the class is that it will notify listeners when a User is added to or removed from the cache. To provide more flexibility, the programmer chooses to utilize the Decorator pattern to enhance an existing data structure with his cache logic rather than creating it from scratch.

Let’s see what the class looks like:

    public class UserCache extends AbstractSet<User> {

        private Set<User> rawSet;

        public UserCache(Set<User> rawSet) {
            this.rawSet = rawSet;
        }

        @Override
        public Iterator<User> iterator() {
            return rawSet.iterator();
        }

        @Override
        public boolean add(User user) {
            boolean wasAdded = rawSet.add(user);
            notifyListeners(user);
            return wasAdded;
        }

        @Override
        public boolean remove(Object o) {
            boolean wasRemoved = rawSet.remove(o);
            notifyListeners((User)o);
            return wasRemoved;
        }

    }

This is pretty standard use of the Decorator pattern. The programmer has decorated a standard java.util.Set named rawSet with some listener notification logic.

In general, whenever a class decorates another object it’s important (usually imperative) that the decoratee not be used for other purposes. For instance, adding or removing objects to the decoratee directly would be problematic:

    Set<User> internalSet = new HashSet<User>(64);
    Set<User> users       = new UserCache(internalSet);
    . . .
    internalSet.add(someUser);
    internalSet.add(anotherUser);

It is clearly inappropriate to use internalSet in this manner after it has been decorated as we would not call our listeners about these additions. As the creator of the UserCache class, we might not have any idea what sort of code depends on the listener notifications to work correctly. The result of missed notifications could result in any number of really bad problems. So it’s important we make the UserCache class bulletproof in terms of delivering event notifications.

Worried by this possibility the programmer rewrites the UserCache to use an internal java.util.Set rather than decorating a user-provided Set:

    public class UserCache extends AbstractSet<User> {

        private Set<User> rawSet;

        public UserCache() {
            rawSet = new HashSet<User>();
        }

        . . .

    }

So while it isn’t as flexible, at least now there’s no way that a user can get access to the internal java.util.Set and add or remove objects directly, right? As we look over the class we don’t see anywhere that rawSet is returned or passed as a parameter or anything. It would appear that the programmer has solved the problem. While there isn’t anywhere that we return rawSet directly, is there a way that we could be returning another object that will allow access to rawSet?

Take a closer look at the iterator method:

    @Override
    public Iterator<User> iterator() {
        return rawSet.iterator();
    }

Notice that it returns a java.util.Iterator based on rawSet. ‘But,’ you might say, ‘I want to be able to iterate over all the users.’ That’s fine. But recall that the java.util.Iterator interface includes a method named remove that removes the object last returned from next. What happens when we call remove is that the object is removed from rawSet directly – not from UserCache. In this case, UserCache would not receive any notification of the removal and thus it has no possible way to notify the listeners. While we would agree that most callers of iterator do simply want to iterate over the users, there’s no way to guarantee that some caller, some time, may try to remove a user directly via that interface.

In this example we would say that rawSet has escaped via the iterator method. This is a slight variation of escapism because strictly speaking we aren’t returning a reference to rawSet to external code, but we are returning a view of rawSet through which we can modify it. Unfortunately this is a fairly simple example. Take a look at the java.util.List and java.util.Map interfaces and see how many places a view of the internal object is returned. And note too the java.util.ListIterator that extends Iterator and includes an add and a set method :(

At this point it should be emphasized that it isn’t just java.util.Collection-related classes that need to be worried about escapism. The following code snippet shows how an object reference can escape through the constructor:

    public class FileRepository implements FileEventListener {

        private File directory;

        public FileRepository(File directory) {
            FileEventListenerManager.register(this);
            this.directory = directory;
        }

        public void fileAccessed(FileEvent fileEvent) {
            File updatedFile = fileEvent.getFile();

            if (directory.equals(updatedFile.getParentFile()))
                touch(updatedFile);
        }

        . . .

    }

In this second example, we have a class—FileRepository—that listens for accesses to files and touches them if they’re within a certain directory. Notice that constructor lets the this reference escape through the call to FileEventListenerManager.register(). While this may seem innocuous (and arguably intentional), note that the internal directory object has not been initialized. In a multithreaded environment it is perfectly possible that immediately after the listener registration occurs that a file is accessed, thus calling the FileRepository‘s fileAccessed method. If this occurs, a java.lang.NullPointerException will be thrown when the equals method is invoked on a then-null object.

As you can begin to imagine, there are many scenarios in which escapism can cause problems. So, how do we solve the problem of escapism?

In our UserCache example, we have to implement iterator as it’s part of the java.util.Set interface. A first shot might be to provide an implementation of java.util.Iterator that calls back to rawSet on calls to Iterator.remove. While this may seem like a workable solution on the surface, when Iterator.remove is called and then rawSet.remove is called, the caller will receive a java.util.ConcurrentModificationException because he’s both iterating and modifying rawSet “concurrently.”

Another possibility would be to return a java.util.Iterator from a read-only view of rawSet via:

    @Override
    public Iterator<User> iterator() {
        return Collections.unmodifiableSet(rawSet).iterator();
    }

The caller can still call Iterator.remove, but he will receive an java.lang.UnsupportedOperationException. While it’s arguable that teasing the caller with a method that will only fail in this way is bad design, at least it’s consistently so with other parts of the JDK.

In the example of the FileRepository it may seem like the easist thing to do is switch the code in the constructor to read:

    public class FileRepository implements FileEventListener {

        private File directory;

        public FileRepository(File directory) {
            this.directory = directory;
            FileEventListenerManager.register(this);
        }

        public void fileAccessed(FileEvent fileEvent) {
            File updatedFile = fileEvent.getFile();

            if (directory.equals(updatedFile.getParentFile()))
                touch(updatedFile);
        }

        . . .

    }

Initializing the directory object first will make a big difference. However, in other cases it might not be so cut and dry, especially in concurrent applications. The book Java Concurrency in Practice succinctly and bluntly advises:

Do not allow the this reference to escape during construction.

Some suggestions for this case are to create a static factory method such that the object can be fully constructed (and thus ensuring that its object reference is valid) before being handed off to any external code:

    public class FileRepository implements FileEventListener {

        private File directory;

        private FileRepository(File directory) {
            this.directory = directory;
        }

        public void fileAccessed(FileEvent fileEvent) {
            File updatedFile = fileEvent.getFile();

            if (directory.equals(updatedFile.getParentFile()))
                touch(updatedFile);
        }

        public static FileRepository create(File directory) {
            FileRepository fr = new FileRepository(directory);
            FileEventListenerManager.register(fr);
            return fr;
        }

        . . .

    }

Finding possible escape paths for objects takes serious attention to detail, knowing a little about the internals of the objects in use, and, of course, testing. But finding the problem is only half the battle, patching the escape paths is equally challenging. Unfortunately there is no across-the-board way to fix the problems, it takes hunkering down and thinking through the seeming myriad of possibilities.

Happy hunting! ;)