Give Your MySQL Account Access to Hive

One area of the Apache Hive documentation that’s not entirely explicit is in regard to the database privileges needed for its metastore[1]. Developers often become accustomed to creating a database account that has all privileges granted. But in the Real World, end users of Hive must configure it to point to a metastore RDBMS account with very specific privileges granted by the DBA.

Are You Privileged?

By default, Hive is configured to use a (local-to-that-system) Apache Derby database instance as its metastore. However, once an initial evaluation phase of Hive is complete, the team will likely need to move to a shared database to use as their metastore so the whole team is in sync. MySQL is a suitable database to be used as the metastore, though Hive supports other relational databases as well.

So which database roles must be granted to the account used for the Hive metastore in MySQL?

  • CREATE
  • DROP
  • ALTER
  • INDEX
  • SELECT
  • INSERT
  • UPDATE
  • DELETE

You can refer to the official MySQL documentation for granting privileges regarding the specifics of the roles and the exact syntax.

Here’s an example MySQL script that creates a dedicated user (“hive”), a MySQL database (“hivemetastore”), and grants the user some DDL[2] and DML privileges:


CREATE DATABASE hivemetastore;
CREATE USER 'hive'@'%' IDENTIFIED BY 'mypassword';
GRANT CREATE, DROP, ALTER, INDEX ON hivemetastore.* TO 'hive'@'%';
GRANT SELECT, INSERT, UPDATE, DELETE ON hivemetastore.* TO 'hive'@'%';

Once the DBA has created the account and granted it the requisite privileges, the user running Hive can specify this MySQL database as his Hive metastore. This is done either by exporting the following settings via HIVE_OPTS or by specifying them on the command line when invoking the CLI:


javax.jdo.option.ConnectionURL=jdbc:mysql://your.mysql.server/hivemetastore
javax.jdo.option.ConnectionUserName=hive
javax.jdo.option.ConnectionPassword=mypassword

[1] It can be confusing when discussing Hive someone refers to the “Hive” database, tables, or columns. To which are they referring? To the database, tables, and columns that Hive presents (with data stored in HDFS, S3, etc.) or the metastore database, tables, and columns (with data stored in MySQL, Postgres, etc.) in which Hive stores its meta data? Unfortunately the two are often used interchangeably.

[2] The reason that the CREATE, DROP, ALTER, and INDEX privileges are needed is because Hive can dynamically create the needed schema in the metastore database in which to store information about the Hive database.

Using Hive with Existing Files on S3

One feature that Hive gets for free by virtue of being layered atop Hadoop is the S3 file system implementation. The upshot being that all the raw, textual data you have stored in S3 is just a few hoops away from being queried using Hive’s SQL-esque language.

Imagine you have an S3 bucket un-originally named mys3bucket. It contains several really large gzipped files filled with very interesting data that you’d like to query. For the sake of simplicity for this post, let’s assume the data in each file is a simple key=value pairing, one per line.

If you don’t happen to have any data in S3 (or want to use a sample), let’s upload a very simple gzipped file with these values:

    jim=89347
    dave=313925
    noddy=21516
    don=6771

Prerequisites

Both Hive and S3 have their own design requirements which can be a little confusing when you start to use the two together. Let me outline a few things that you need to be aware of before you attempt to mix them together.

First, S3 doesn’t really support directories. Each bucket has a flat namespace of keys that map to chunks of data. However, some S3 tools will create zero-length dummy files that look a whole lot like directories (but really aren’t). It’s best if your data is all at the top level of the bucket and doesn’t try any trickery. There are ways to use these pseudo-directories to keep data separate, but let’s keep things simple for now.

Second, ensure that the S3 bucket that you want to use with Hive only includes homogeneously-formatted files. Don’t include a CSV file, Apache log, and tab-delimited file in the same bucket. We need to tell Hive the format of the data so that when it reads our data it knows what to expect.

Third, even though this tutorial doesn’t instruct you to do this, Hive allows you to overwrite your data. This could mean you might lose all your data in S3 – so please be careful! If you need to, make a copy of the data into another S3 bucket for testing. Note that there is an existing Jira ticket to make external tables optionally read only, but it’s not yet implemented.

Install Hive

Of course, the first thing you have to do is to install Hive. This is fairly straightforward and perhaps my previous post on this topic can help out. I’m doing some development (bug fixes, etc.), so I’m running off of trunk. But a reasonably recent version should work fine.

Let’s assume you’ve defined an environment variable named HIVE_HOME that points to where you’ve installed Hive on your local machine. If you don’t really want to define an environment variable, just replace $HIVE_OPTS with your installation directory in the remaining instructions.

Update Configuration

Now, let’s change our configuration a bit so that we can access the S3 bucket with all our data. First, we need to include the following configuration. This can be done via HIVE_OPTS, configuration files ($HIVE_HOME/conf/hive-site.xml), or via Hive CLI’s SET command.

Here are the configuration parameters:

Name Value
fs.s3n.awsAccessKeyId Your S3 access key
fs.s3n.awsSecretAccessKey Your S3 secret access key

Create Table Over S3 Bucket

Whether you prefer the term veneer, façade, wrapper, or whatever, we need to tell Hive where to find our data and the format of the files. Let’s create a Hive table definition that references the data in S3:

    CREATE EXTERNAL TABLE mydata (key STRING, value INT)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '='
    LOCATION 's3n://mys3bucket/';

Note: don’t forget the trailing slash in the LOCATION clause!

Here we’ve created a Hive table named mydata that has two columns: a key and a value. The FIELDS TERMINATED clause tells Hive that the two columns are separated by the ‘=’ character in the data files. The LOCATION clause points to our external data in mys3bucket.

Now, we can query the data:

    SELECT * FROM mydata ORDER BY key;

The result would look something like this:

    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks determined at compile time: 1
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=
    In order to set a constant number of reducers:
      set mapred.reduce.tasks=
    plan = file:/tmp/kirk/hive_2010-09-30_13-24-57_172_4142256212187799609/-local-10002/plan.xml
    Job running in-process (local Hadoop)
    2010-09-30 13:25:00,715 null map = 100%,  reduce = 100%
    Ended Job = job_local_0001
    OK
    dave	313925
    don	6771
    jim	89347
    noddy	21516
    Time taken: 3.742 seconds

Because we’re kicking off a map-reduce job to query the data and because the data is being pulled out of S3 to our local machine, it’s a bit slow. But at the scale at which you’d use Hive, you would probably want to move your processing to EC2/EMR for data locality.

Conclusion

Of course, there are many other ways that Hive and S3 can be combined. You may opt to use S3 as a place to store source data and tables with data generated by other tools. You can use S3 as a Hive storage from within Amazon’s EC2 and Elastic MapReduce. You can use S3 as a starting point and pull the data into HDFS-based Hive tables. Hive presents a lot of possibilities — which can be daunting at first — but the positive spin is that these options are very likely to coincide with your unique needs.

References

Please see also the following links for Hive and S3 usage from the official Hive wiki:

Overview of Using Hive with AWS
Overview of Using Hive with S3
Using Hive with Compressed Data Storage

How To Try Out Hive on Your Local Machine — And Not Upset Your Ops Team

According to the Hive web site:

Hive is a data warehouse infrastructure built on top of Hadoop that provides tools to enable easy data summarization, adhoc querying and analysis of large datasets data stored in Hadoop files.

Hive is built on top of various technologies, the most notable being Hadoop and HDFS. As a result, to run Hive, you need access to a Hadoop cluster running a job tracker, task trackers, DFS nodes, and so on. You will also need an external database (MySQL, PostgreSQL, etc.) to store Hive’s meta data.

Even if you have such an environment available to you, for exploratory reasons it may be faster and less error prone to work in a sandbox. It’s very easy to seemingly delete files from HDFS using common Hive commands (loading data from HDFS will by default move the file to the Hive-managed location). While you’re finding your way around Hive, it might be best to isolate yourself from the outside world (metaphorically, mind you) to avoid causing any problems.

So then, how can you try out Hive on your local machine, and in a safe sandbox? Well, it’s actually pretty easy.

First, you’ll need to download and build Hive from the source.

Next, before running Hive, simply export the following environment variable:

export HIVE_OPTS="-hiveconf mapred.job.tracker=local \
   -hiveconf fs.default.name=file://`pwd`/tmp \
   -hiveconf hive.metastore.warehouse.dir=file://`pwd`/tmp/warehouse \
   -hiveconf javax.jdo.option.ConnectionURL=jdbc:derby:;databaseName=`pwd`/tmp/metastore_db;create=true"

(Sorry about the WordPress formatting; watch out for the last line in the above…)

The HIVE_OPTS environment variable is used by the Hive command-line utility to provide overrides to the default Hive configuration. (Note: some references use the incorrect name HIVE_OPT (i.e. missing the S) which, of course, causes the values to be silently ignored.) Setting it prior to running the bin/hive script will cause the values set in the environment variable to be used instead.

Just for completeness, let’s review the settings:

  • mapred.job.tracker – This is a standard Hadoop configuration option to point to the URL of the job tracker. The magic value of “local” will cause Hadoop to be run on the local machine instead.
  • fs.default.name – This is another standard Hadoop configuration option to specify the root of the distributed file system used by Hadoop (often HDFS, but not necessarily). By using a file://-based URL, we use our local file system, which–for tests–is likely sufficient.
  • hive.metastore.warehouse.dir – This setting is specific to Hive, and is the directory name (relative to the fs.default.name) in which Hive’s warehouse data is stored. Again, we use a local file system-specific path.
  • javax.jdo.option.ConnectionURL – This is a standard JDBC URL used by Hive to connect to its meta data store. Using the value of jdbc:derby:;databaseName=`pwd`/tmp/metastore_db;create=true" allows us to use a local, embedded Derby database with its files stored on the local file system.

At this point, you should be ready to run Hive. Notice that as you create, load, and query from your database, the directory under the current directory (named “tmp”) is populated with a number of files. And guess what? If you want to start all over from scratch, simply exit Hive, delete that directory, and everything is new again.

Also of value in your travels is to set the logging level from the command line:

export HIVE_OPTS="-hiveconf hive.root.logger=DEBUG,console"

You can adjust the logging level to what you need. Another great benefit of running Hadoop locally is that you can get the debug logging to your console for both the Hive client and map/reduce execution.