Configuration Hive Java Scalability

How To Try Out Hive on Your Local Machine — And Not Upset Your Ops Team

According to the Hive web site:

Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files.

Hive is built on top of various technologies, the most notable being Hadoop and HDFS. As a result, to run Hive, you need access to a Hadoop cluster running a job tracker, task trackers, DFS nodes, and so on. You will also need an external database (MySQL, PostgreSQL, etc.) to store Hive’s meta data.

Even if you have such an environment available to you, for exploratory reasons it may be faster and less error prone to work in a sandbox. It’s very easy to seemingly delete files from HDFS using common Hive commands (loading data from HDFS will by default move the file to the Hive-managed location). While you’re finding your way around Hive, it might be best to isolate yourself from the outside world (metaphorically, mind you) to avoid causing any problems.

So then, how can you try out Hive on your local machine, and in a safe sandbox? Well, it’s actually pretty easy.

First, you’ll need to download and build Hive from the source.

Next, before running Hive, simply export the following environment variable:

export HIVE_OPTS="-hiveconf mapred.job.tracker=local \
   -hiveconf`pwd`/tmp \
   -hiveconf hive.metastore.warehouse.dir=file://`pwd`/tmp/warehouse \
   -hiveconf javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=`pwd`/tmp/metastore_db;create=true"

(Sorry about the WordPress formatting; watch out for the last line in the above…)

The HIVE_OPTS environment variable is used by the Hive command-line utility to provide overrides to the default Hive configuration. (Note: some references use the incorrect name HIVE_OPT (i.e. missing the S) which, of course, causes the values to be silently ignored.) Setting it prior to running the bin/hive script will cause the values set in the environment variable to be used instead.

Just for completeness, let’s review the settings:

  • mapred.job.tracker – This is a standard Hadoop configuration option to point to the URL of the job tracker. The magic value of “local” will cause Hadoop to be run on the local machine instead.
  • – This is another standard Hadoop configuration option to specify the root of the distributed file system used by Hadoop (often HDFS, but not necessarily). By using a file://-based URL, we use our local file system, which–for tests–is likely sufficient.
  • hive.metastore.warehouse.dir – This setting is specific to Hive, and is the directory name (relative to the in which Hive’s warehouse data is stored. Again, we use a local file system-specific path.
  • javax.jdo.option.ConnectionURL – This is a standard JDBC URL used by Hive to connect to its meta data store. Using the value of jdbc:derby:;databaseName=`pwd`/tmp/metastore_db;create=true" allows us to use a local, embedded Derby database with its files stored on the local file system.

At this point, you should be ready to run Hive. Notice that as you create, load, and query from your database, the directory under the current directory (named “tmp”) is populated with a number of files. And guess what? If you want to start all over from scratch, simply exit Hive, delete that directory, and everything is new again.

Also of value in your travels is to set the logging level from the command line:

export HIVE_OPTS="-hiveconf hive.root.logger=DEBUG,console"

You can adjust the logging level to what you need. Another great benefit of running Hadoop locally is that you can get the debug logging to your console for both the Hive client and map/reduce execution.