Give Your MySQL Account Access to Hive

One area of the Apache Hive documentation that’s not entirely explicit is in regard to the database privileges needed for its metastore[1]. Developers often become accustomed to creating a database account that has all privileges granted. But in the Real World, end users of Hive must configure it to point to a metastore RDBMS account with very specific privileges granted by the DBA.

Are You Privileged?

By default, Hive is configured to use a (local-to-that-system) Apache Derby database instance as its metastore. However, once an initial evaluation phase of Hive is complete, the team will likely need to move to a shared database to use as their metastore so the whole team is in sync. MySQL is a suitable database to be used as the metastore, though Hive supports other relational databases as well.

So which database roles must be granted to the account used for the Hive metastore in MySQL?

  • CREATE
  • DROP
  • ALTER
  • INDEX
  • SELECT
  • INSERT
  • UPDATE
  • DELETE

You can refer to the official MySQL documentation for granting privileges regarding the specifics of the roles and the exact syntax.

Here’s an example MySQL script that creates a dedicated user (“hive”), a MySQL database (“hivemetastore”), and grants the user some DDL[2] and DML privileges:


CREATE DATABASE hivemetastore;
CREATE USER 'hive'@'%' IDENTIFIED BY 'mypassword';
GRANT CREATE, DROP, ALTER, INDEX ON hivemetastore.* TO 'hive'@'%';
GRANT SELECT, INSERT, UPDATE, DELETE ON hivemetastore.* TO 'hive'@'%';

Once the DBA has created the account and granted it the requisite privileges, the user running Hive can specify this MySQL database as his Hive metastore. This is done either by exporting the following settings via HIVE_OPTS or by specifying them on the command line when invoking the CLI:


javax.jdo.option.ConnectionURL=jdbc:mysql://your.mysql.server/hivemetastore
javax.jdo.option.ConnectionUserName=hive
javax.jdo.option.ConnectionPassword=mypassword

[1] It can be confusing when discussing Hive someone refers to the “Hive” database, tables, or columns. To which are they referring? To the database, tables, and columns that Hive presents (with data stored in HDFS, S3, etc.) or the metastore database, tables, and columns (with data stored in MySQL, Postgres, etc.) in which Hive stores its meta data? Unfortunately the two are often used interchangeably.

[2] The reason that the CREATE, DROP, ALTER, and INDEX privileges are needed is because Hive can dynamically create the needed schema in the metastore database in which to store information about the Hive database.

How To Try Out Hive on Your Local Machine — And Not Upset Your Ops Team

According to the Hive web site:

Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files.

Hive is built on top of various technologies, the most notable being Hadoop and HDFS. As a result, to run Hive, you need access to a Hadoop cluster running a job tracker, task trackers, DFS nodes, and so on. You will also need an external database (MySQL, PostgreSQL, etc.) to store Hive’s meta data.

Even if you have such an environment available to you, for exploratory reasons it may be faster and less error prone to work in a sandbox. It’s very easy to seemingly delete files from HDFS using common Hive commands (loading data from HDFS will by default move the file to the Hive-managed location). While you’re finding your way around Hive, it might be best to isolate yourself from the outside world (metaphorically, mind you) to avoid causing any problems.

So then, how can you try out Hive on your local machine, and in a safe sandbox? Well, it’s actually pretty easy.

First, you’ll need to download and build Hive from the source.

Next, before running Hive, simply export the following environment variable:

export HIVE_OPTS="-hiveconf mapred.job.tracker=local \
   -hiveconf fs.default.name=file://`pwd`/tmp \
   -hiveconf hive.metastore.warehouse.dir=file://`pwd`/tmp/warehouse \
   -hiveconf javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=`pwd`/tmp/metastore_db;create=true"

(Sorry about the WordPress formatting; watch out for the last line in the above…)

The HIVE_OPTS environment variable is used by the Hive command-line utility to provide overrides to the default Hive configuration. (Note: some references use the incorrect name HIVE_OPT (i.e. missing the S) which, of course, causes the values to be silently ignored.) Setting it prior to running the bin/hive script will cause the values set in the environment variable to be used instead.

Just for completeness, let’s review the settings:

  • mapred.job.tracker – This is a standard Hadoop configuration option to point to the URL of the job tracker. The magic value of “local” will cause Hadoop to be run on the local machine instead.
  • fs.default.name – This is another standard Hadoop configuration option to specify the root of the distributed file system used by Hadoop (often HDFS, but not necessarily). By using a file://-based URL, we use our local file system, which–for tests–is likely sufficient.
  • hive.metastore.warehouse.dir – This setting is specific to Hive, and is the directory name (relative to the fs.default.name) in which Hive’s warehouse data is stored. Again, we use a local file system-specific path.
  • javax.jdo.option.ConnectionURL – This is a standard JDBC URL used by Hive to connect to its meta data store. Using the value of jdbc:derby:;databaseName=`pwd`/tmp/metastore_db;create=true" allows us to use a local, embedded Derby database with its files stored on the local file system.

At this point, you should be ready to run Hive. Notice that as you create, load, and query from your database, the directory under the current directory (named “tmp”) is populated with a number of files. And guess what? If you want to start all over from scratch, simply exit Hive, delete that directory, and everything is new again.

Also of value in your travels is to set the logging level from the command line:

export HIVE_OPTS="-hiveconf hive.root.logger=DEBUG,console"

You can adjust the logging level to what you need. Another great benefit of running Hadoop locally is that you can get the debug logging to your console for both the Hive client and map/reduce execution.

On the Trials of Using Websphere's JMS Provider

I was recently tasked with getting a stand-alone JMS-based client running using Websphere’s built-in JMS provider. I was confident that with just a bit of administration, a set of JAR files, and some jndi.properties values I could get something up and running in an hour or maybe two…

After about twelve hours I finally finished a prototype. The Java code was exceedingly simple and non-Websphere specific, but the Websphere-specific administration and settings were a nightmare. Here I’ve retraced my steps in case anyone else ever needs to do this again.

Pay very, very, very close attention to Websphere’s OS-specific requirements regarding system dependencies.

Hurdle 1: Installation

Getting the Software

The environment that I was using was a Fedora 8-based server and client. On the server we’re running Websphere 6.1 on the server as it matches what my client is running. On the server I’m using IBM’s JRE (of course) while on the client I wanted to specifically try to be JRE-agnostic, so I chose to use a late-model Sun 1.6 JRE.

The first piece of software to grab is Websphere itself. This was downloaded as was.cd.6100.trial.base.linux.ia32.tar.gz.

To satisfy Websphere’s requirements for Fedora simply run yum via root:

    # yum -y install compat-libstdc++-* compat-db

Lastly, you’ll need to get the so-called IBM Client for JMS on J2SE with IBM WebSphere Application Server installation package. This is downloaded as sibc_install-o0810.09.jar.

Installation

Installation is graphical by default. I tried to figure out how to install Websphere in some sort of console mode but all my efforts proved fruitless. (It is supposed to have a console and silent install mode, I just couldn’t get it to work.) So I begrudgingly fired up a VNC session to get to a UI.

First, extract was.cd.6100.trial.base.linux.ia32.tar.gz to a temporary directory and then run the script named launchpad.sh. Simply click to install the trial, using the default location of /opt/IBM/WebSphere/AppServer. Run the first steps console to perform “Installation verification” or to “Start the server.”

Note, if clicking the installer link in the UI appears to do nothing, please double-check that you’ve installed any OS-specific dependencies. This step took about two hours to realize why it wasn’t starting. Since Fedora isn’t an officially supported OS, I used the RHEL 4 directions.

At this point you can connect to Websphere’s “Integrated Solutions Console” (AKA the administration front-end) from another machine via the browser:


http://host:9060/ibm/console

Please substitute host with the host on which Websphere is installed.

Once you get the “Integrated Solutions Console” UI up, you’ll need to log in before you do anything else.

Hurdle 2: Configuring Websphere’s JMS Provider

Websphere’s administration UI is fairly clean, but knowing what you’re supposed to do to make it work is something else altogether. Note: there doesn’t seem to be any centralized documentation for Websphere. Every other question I had resulted in a 5-30 minute Google session to find the answer.

Configuring the Bus

The first thing to do is to configure Websphere’s “bus.” You have to do this before you set up any JMS-specific settings.

So – in the “Integrated Solutions Console” UI, expand the “Service Integration” node and then click “Buses”. Click “New” to create a new bus and name it simply “MyBus” and then click “Next” and “Finish”.

Note: You’ll be seeing the message “Changes have been made to your local configuration” a lot. Simply click the link to “Save” the configuration. We’ll restart the server from the command line once we’re all finished.

Next, go to the “MyBus” details screen. Click the link named “Bus members” and then “Add”. You’ll be asked to specify a “Server”, “Cluster”, or “WebSphere MQ server”. Select “Server” and click “Next”. Leave the “type of message store” as its default (“File store”) and click “Next”. The “message store properties” can be left as-is; click “Next” and then “Finish”. Don’t forget to “Save” the configuration as described above.

Next, go to the following screen: “Buses”, then “MyBus”, and then “Destinations”. Click “New” to create a new Destination. When prompted to “Select destination type”, choose “Topic space”, and click “Next”. Enter the “Identifier” “MyTopicSpace” and click “Next”. Leave the “Assign the queue to a bus member” select as-is and click “Next”, then “Finish”, and then save the configuration.

Configuring JMS Resources

In the “Integrated Solutions Console” UI, expand the “Resources” node in the UI, expand the “JMS” node, and then click “JMS providers”. First, select a “Scope” of “Node=Node01, Server=server1″ where is the host on which Websphere is installed. After the page has refreshed, click the link “Default messaging provider”. Click “Connection factories” and then “New”. We’re going to leave most of the options blank, but do enter the “Name” as “MyConnectionFactory” and the “JNDI name” as “jms/MyConnectionFactory”. For the “Bus name” select “MyBus” and then wait for the screen to refresh. Next enter the value “host:7276″ for the “Provider endpoints” option where host is the host on which Websphere is installed. Next, click “OK” and then save the configuration.

Next, expand the “Resources” node in the UI, expand the “JMS” node, and then click “JMS providers”. The “Scope” of “Node=Node01, Server=server1″ should still be selected. Click the link “Default messaging provider”, then click “Topics”, and then “New”. We’re also going to leave most of these options blank, but do enter the “Name” as “MyTopic” and the “JNDI name” as “jms/MyTopic”. For the “Bus name” select “MyBus” and then wait for the screen to refresh. Next select the “Topic space” named “MyTopicSpace” and wait for the screen to refresh. Next, click “OK” and then save the configuration.

Restarting Websphere

Don’t forget to restart! We need to restart so that our new configuration will become effective.

Simply drop into a terminal window and execute the following:

    # /opt/IBM/WebSphere/AppServer/bin/stopServer.sh server1 && sleep 5 && /opt/IBM/WebSphere/AppServer/bin/startServer.sh server1

Hurdle 3: Developing a Stand-alone Java Client

We’re finally ready to tackle wiring up our client to our new JMS server. Fortunately, this piece is pretty straightforward ;)

Installing the Client-side Libraries

It took me quite a bit of time to find the necessary client-side files to talk to Websphere’s JMS provider. These aren’t simply a set of JARs in the Websphere installation somewhere. They’re not up on any Maven repository or anything. Remember that scary sounding file that we downloaded before — sibc_install-o0810.09.jar? That file isn’t the JMS libraries but an installer for the libraries. So, on the client, run the installer thusly:

    $ java -jar /tmp/sibc_install-o0810.09.jar jms_jndi_sun -silent /tmp/jms

Note that the above assumes you’re running a Sun-based JRE on the client. If you’re running under IBM’s JRE you’ll need to use jms_jndi_ibm instead of jms_jndi_sun.

In the /tmp/jms/lib directory are the three JARs you’ll need to have in your CLASSPATH.

(Man, I’m so spoiled by Maven.)

Configuring JNDI Access to JMS

The last bit of the puzzle is the JNDI context configuration to connect to the JMS server. Although it took me about three hours to find these settings, they’re all that are needed. Simply put the following entries in jndi.properties (or Spring or wherever):

    java.naming.provider.url=iiop://stampy:2809
    java.naming.factory.initial=com.ibm.websphere.naming.WsnInitialContextFactory
    com.ibm.CORBA.ORBInit=com.ibm.ws.sib.client.ORB

That should be all you need.

Conclusion

My hat goes off to all Websphere administrators and developers. I have apparently been enjoying a very carefree existence in the lightweight/Spring/POJO world for the last five or so years.

If there’s one thing I’ve learned from this experience, it is the value of patience ;)

Simple Load Balancing using Apache mod_proxy_balancer in Just Three Steps

Apache’s mod_proxy_balancer is a very easy to set up load balancer.

You could say that I was born with a hardware-based load balancer, as I have never personally been involved in the purchasing, evaluation, administration, and so forth of production load balancers. This was left up to the ops team, not the engineers. That said I found setting up a rudimentary load balancing development environment with mod_proxy_balancer to be trivial. My use was limited to being able to verify that my client’s web application was designed correctly to scale past a single server instance in as short a time as possible. For that goal, it met my needs.

To set up a trivial load balancing environment, you really only need three machines – one web server (via which we’ll load balance) and two or more backing web application servers. And there are only three steps to do it…

First, double check that mod_proxy_balancer is provided with your Apache web server and is installed. I found that it was in my Fedora-based installation, though I’ve seen reports of other configurations not having this by default. There are a couple of ways to verify the module is installed, but one way is:

# apachectl -t -D DUMP_MODULES 2>&1 | grep proxy_balancer_module

And another is to look for “LoadModule proxy_balancer_module modules/mod_proxy_balancer.so” in your httpd.conf file.

Second, specify the web application servers over which to load balance by opening your httpd.conf file and adding these lines:

ProxyPass / http://app.example.com/

<Proxy http://app.example.com/&gt;
BalancerMember http://app-01.example.com:8080
BalancerMember http://app-02.example.com:8080
</Proxy>

Naturally, you’re going to want to change the host names, ports, etc. to be applicable for your environment. Note: IP addresses work just fine, too.

Third, restart Apache and verify that your “worker” servers are receiving and responding to the requests. (The Apache ‘error log’ has some nice output too if you’ve entered something wrong on configuration.)

That’s it. You’ve just set up a simple load balancing environment.

In my case, my web application servers are identical and my needs were fairly simple. However, mod_proxy_balancer includes options for weighting based on request counts, request byte sizes, and static factoring. It also has an optional module to support dynamic reconfiguring of the balanced machines as well. All in all, very recommended for development use.

Configuring MySQL to Run in EC2's EBS under a Fedora AMI

There’s a lot of buzz around Amazon’s recent announcement/availability of their EBS add-on for EC2. The main reason being that it gets around the deficiency of EC2 wherein you lose all your locally-stored data if someone or something crashes your EC2 instance.

Certainly, one of the main uses of EBS will be to persistently host a database server. There have been some tutorials as to how to set up MySQL to use an EBS share. One that I especially appreciated was Eric Hammond’s tutorial. Unfortunately, it didn’t work out-of-the-box for me when running a Fedora 8-based AMI rather than an Ubuntu-based AMI the tutorial uses.

Since I had to tweak a few things to get it to work, I thought I’d share the solution in case you run into the same problem. It’s also good to document it for when my memory fails me ;)

The following is an addendum to the above mentioned tutorial that I used to get MySQL working on an EBS share using Fedora 8. For brevity I refer you to the tutorial for all steps apart from the section titled “Configuring MySQL to use the EBS volume”. Specifically, I used the AMI “ami-2b5fba42″ image that is a Fedora 8 image that doesn’t include MySQL. I installed MySQL (5.x) on the system using yum.

In actuality, this probably applies more broadly than EC2 and rather describes some tweaks one would need to make to get MySQL running from a non-standard location on Fedora.

  1. First, stop the MySQL instance if it’s running: # /etc/init.d/mysql stop
  2. Next, create the paths for MySQL: # mkdir -p /vol/lib /vol/log
  3. Then mv /var/lib/mysql /vol/lib/ && mv /var/log/mysql /vol/log/
  4. Open your MySQL configuration file (mine was at /etc/my.cnf) and replace all instances of "/var/" with "/vol/" as /vol is used as the new MySQL home in the rest of the tutorial. (/vol is mounted against the EBS-backed device.)
  5. Open your MySQL startup script (mine was at /etc/init.d/mysqld) and replace all instances of "/var/" with "/vol/" as per the rest of the tutorial
  6. In the MySQL startup script, find the line that calls # /usr/bin/mysqladmin and add the 'socket file' command line option to point to our non-standard socket file ($socketfile): /usr/bin/mysqladmin -S$socketfile
  7. Next, start the MySQL instance running: # /etc/init.d/mysql start

Note that from this point you have to call most--if not all--MySQL command line utilities with the socket file argument. This file is specified in your MySQL configuration. If you forget, the command line will give you an error that should remind you. For example...

# mysql -uroot

...yields...

ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/lib/mysql/mysql.sock' (2)

The biggest change needed was to update the references in /etc/init.d/mysqld to point to /vol and add the explicit socket file argument.