Categories
Hadoop Hive

Using Hive with Existing Files on S3

One feature that Hive gets for free by virtue of being layered atop Hadoop is the S3 file system implementation. The upshot being that all the raw, textual data you have stored in S3 is just a few hoops away from being queried using Hive’s SQL-esque language.

Imagine you have an S3 bucket un-originally named mys3bucket. It contains several really large gzipped files filled with very interesting data that you’d like to query. For the sake of simplicity for this post, let’s assume the data in each file is a simple key=value pairing, one per line.

If you don’t happen to have any data in S3 (or want to use a sample), let’s upload a very simple gzipped file with these values:

    jim=89347
    dave=313925
    noddy=21516
    don=6771

Prerequisites

Both Hive and S3 have their own design requirements which can be a little confusing when you start to use the two together. Let me outline a few things that you need to be aware of before you attempt to mix them together.

First, S3 doesn’t really support directories. Each bucket has a flat namespace of keys that map to chunks of data. However, some S3 tools will create zero-length dummy files that look a whole lot like directories (but really aren’t). It’s best if your data is all at the top level of the bucket and doesn’t try any trickery. There are ways to use these pseudo-directories to keep data separate, but let’s keep things simple for now.

Second, ensure that the S3 bucket that you want to use with Hive only includes homogeneously-formatted files. Don’t include a CSV file, Apache log, and tab-delimited file in the same bucket. We need to tell Hive the format of the data so that when it reads our data it knows what to expect.

Third, even though this tutorial doesn’t instruct you to do this, Hive allows you to overwrite your data. This could mean you might lose all your data in S3 – so please be careful! If you need to, make a copy of the data into another S3 bucket for testing. Note that there is an existing Jira ticket to make external tables optionally read only, but it’s not yet implemented.

Install Hive

Of course, the first thing you have to do is to install Hive. This is fairly straightforward and perhaps my previous post on this topic can help out. I’m doing some development (bug fixes, etc.), so I’m running off of trunk. But a reasonably recent version should work fine.

Let’s assume you’ve defined an environment variable named HIVE_HOME that points to where you’ve installed Hive on your local machine. If you don’t really want to define an environment variable, just replace $HIVE_OPTS with your installation directory in the remaining instructions.

Update Configuration

Now, let’s change our configuration a bit so that we can access the S3 bucket with all our data. First, we need to include the following configuration. This can be done via HIVE_OPTS, configuration files ($HIVE_HOME/conf/hive-site.xml), or via Hive CLI’s SET command.

Here are the configuration parameters:

Name Value
fs.s3n.awsAccessKeyId Your S3 access key
fs.s3n.awsSecretAccessKey Your S3 secret access key

Create Table Over S3 Bucket

Whether you prefer the term veneer, fa├žade, wrapper, or whatever, we need to tell Hive where to find our data and the format of the files. Let’s create a Hive table definition that references the data in S3:

    CREATE EXTERNAL TABLE mydata (key STRING, value INT)
    ROW FORMAT DELIMITED FIELDS TERMINATED BY '='
    LOCATION 's3n://mys3bucket/';

Note: don’t forget the trailing slash in the LOCATION clause!

Here we’ve created a Hive table named mydata that has two columns: a key and a value. The FIELDS TERMINATED clause tells Hive that the two columns are separated by the ‘=’ character in the data files. The LOCATION clause points to our external data in mys3bucket.

Now, we can query the data:

    SELECT * FROM mydata ORDER BY key;

The result would look something like this:

    Total MapReduce jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks determined at compile time: 1
    In order to change the average load for a reducer (in bytes):
      set hive.exec.reducers.bytes.per.reducer=
    In order to limit the maximum number of reducers:
      set hive.exec.reducers.max=
    In order to set a constant number of reducers:
      set mapred.reduce.tasks=
    plan = file:/tmp/kirk/hive_2010-09-30_13-24-57_172_4142256212187799609/-local-10002/plan.xml
    Job running in-process (local Hadoop)
    2010-09-30 13:25:00,715 null map = 100%,  reduce = 100%
    Ended Job = job_local_0001
    OK
    dave	313925
    don	6771
    jim	89347
    noddy	21516
    Time taken: 3.742 seconds

Because we’re kicking off a map-reduce job to query the data and because the data is being pulled out of S3 to our local machine, it’s a bit slow. But at the scale at which you’d use Hive, you would probably want to move your processing to EC2/EMR for data locality.

Conclusion

Of course, there are many other ways that Hive and S3 can be combined. You may opt to use S3 as a place to store source data and tables with data generated by other tools. You can use S3 as a Hive storage from within Amazon’s EC2 and Elastic MapReduce. You can use S3 as a starting point and pull the data into HDFS-based Hive tables. Hive presents a lot of possibilities — which can be daunting at first — but the positive spin is that these options are very likely to coincide with your unique needs.

References

Please see also the following links for Hive and S3 usage from the official Hive wiki:

Overview of Using Hive with AWS
Overview of Using Hive with S3
Using Hive with Compressed Data Storage