GHive is a small Groovy wrapper around Hive. It provides a few basic functionalities. It can be particularly useful in the cases mentioned above. I wrote this a couple of months ago, and I'm releasing it as open source now. The code is at http://github.com/mlimotte/GHive.
FEATURES
Variable Interpolation: The script can include variables of the form ${name} which are replaced by values from the vars Map. This is identical to the Amazon EMR extended functionality. It's a erally useful feature, and makes sense that it should be available without EMR. Ideally, it will be added to mainstream Hive, eventually.
Comments: If you have a long query, it is useful to include comments, just like you would in a long SQL query. With Hive, you can do that for a script run with "hive -f", but not for queries that are executed through JDBC. With GHive, you can include comments with a -- at the beginning of a line or whitespace followed by --. You can not put a comment at the end of a line. That's because I didn't want to interfere with scripts that might include a string with -- in them. For example,
WHERE my_field = '--'...would be a problem. It would be nice to fix this with a smarter regex or parser, but it wasn't worth the trouble for me.
Dump processed script to a file: It can be useful to save a copy of the script, after the GHive pre-processor has worked on it. A simple use case is a complicated hive job that runs once a day. If you dump the script to a log directory, you can refer to it later if there are errors and even run the individual commands interactively through the Hive shell. There is a simple GHive API call to do this (see below).
Multiple commands through JDBC: The commands are sent through the Hive JDBC driver. If you have a set of distinct queries in a text file and try to feed them all to the JDBC driver at once, you'll notice that JDBC only accepts one command at a time. So GHive separates the script into multiple commands, which are fed to JDBC serially. Commands are terminated by ";" or EOF.
BUILDING
- Install Apache Ant (I used version 1.7.1)
- Adjust any paths in build.xml
- Run:
ant jar - The resulting jar will be at ./ghive.jar
USAGE
Example 1 (simple script)
Create a hive script with variables and comments. For example, create a file "hive/simple.ghive":-- My simple hive script
-- For simplicity, I'm assuming the tables already exist
ADD JAR ${HIVE_LIB}/hive_contrib.jar;
ALTER TABLE user_ex ADD PARTITION (dt='${DATE}');
INSERT OVERWRITE TABLE tmp
SELECT username,
-- first_name,
-- last_name,
email,
phone
FROM user_ex;
In your Groovy Code
GHive ghive = GHive.instance()
// vars are a Map, the keys are case-sensitive. Remember, in
// Groovy, symbols used as keys in a map don't need to be quoted.
// I.e. [ FOO : 'foo' ] is equivalent to [ 'FOO' : foo ]
def vars = [
HIVE_LIB : '/usr/lib/hive/lib',
DATE : '2010-07-02'
]
// The use of dumpScript is optional, and just writes a copy of the GHive
// processed hive commands to disk. The resulting file could be fed
// directly to the hive shell via the -f flag.
ghive.dumpScript("hive/simple.ghive",vars,"output/simple.hive")
ghive.executeScript("hive/simple.ghive",vars)
Example 2
Run some query on each for a list of names and process the result in hive. For example, a script "hive/getdata.ghive":-- A user may be in multiple groupsAnd groovy code:
SELECT username, group
FROM user_group
WHERE username = ${USERNAME}
GHive ghive = GHive.instance()
def usernames = [ 'gilbert', 'brook', 'xtreme' ]
usernames.each { username ->
ghive.eachRow("hive/getdata.ghive", [ USERNAME : username ]) { rs ->
def group = rs.getString(2) // Like standard java sql, a 1 based index number
println "$username is a member of $group"
}
}
UNIT TESTING
See "test/ghive/TestSimple.groovy". The main testcase method from that example is here:@Test
public void testSimple() {
// the q path is relative to the classpath
def q = "simple.hive"
def queries = ghive.parseScript (q, [ STORAGE_TYPE: 'TEXTFILE' ])
ghive.execute(queries[0]) // create table
ghive.execute(queries[1]) // load data
def result = ghive.executeAndGetList(
"select id, value, amt from simple",
[ 'id', 'value', 'amt' ])
assertEquals(3,result.size())
def expected = [
[ id:'1', value:'line1', amt:'0.2' ],
[ id:'100', value:'line2', amt:'0.3' ],
[ id:'50', value:'line3', amt:'0.4' ]]
assertEquals(expected,result)
}
THOUGHTS ON THE DSL
This is a simple, external DSL. Conceivably, I could use a full parser instead of the simple REGEXs and expand this into a full DSL with conditionals and loops and so on. But, as Example 2 shows, you can just use groovy to do this.If that's a typical use case, it might make more sense to create an internal DSL.
0 comments:
Post a Comment