Saturday, October 9, 2010

How to unit test Apache Hive scripts, including those made for EMR

Apache hive has proven to be a useful tool for writing Hadoop Map/Reduce queries.  To do something a little more complex, requires some string manipulation.  If you're using Amazon's EMR version of Hive, it includes the ability to provide values for variables at run time.  But what if you aren't using EMR, or you are, but you want to test those scripts locally? 


GHive is a small Groovy wrapper around Hive. It provides a few basic functionalities. It can be particularly useful in the cases mentioned above.  I wrote this a couple of months ago, and I'm releasing it as open source now. The code is at http://github.com/mlimotte/GHive.

FEATURES


Variable Interpolation: The script can include variables of the form ${name} which are replaced by values from the vars Map. This is identical to the Amazon EMR extended functionality.  It's a erally useful feature, and makes sense that it should be available without EMR.  Ideally, it will be added to mainstream Hive, eventually.

Comments: If you have a long query, it is useful to include comments, just like you would in a long SQL query.  With Hive, you can do that for a script run with "hive -f", but not for queries that are executed through JDBC.  With GHive, you can include comments with a -- at the beginning of a line or whitespace followed by --.  You can not put a comment at the end of a line. That's because I didn't want to interfere with scripts that might include a string with -- in them. For example,
WHERE my_field = '--'
...would be a problem.  It would be nice to fix this with a smarter regex or parser, but it wasn't worth the trouble for me.

Dump processed script to a file: It can be useful to save a copy of the script, after the GHive pre-processor has worked on it. A simple use case is a complicated hive job that runs once a day. If you dump the script to a log directory, you can refer to it later if there are errors and even run the individual commands interactively through the Hive shell. There is a simple GHive API call to do this (see below).

Multiple commands through JDBC: The commands are sent through the Hive JDBC driver. If you have a set of distinct queries in a text file and try to feed them all to the JDBC driver at once, you'll notice that JDBC only accepts one command at a time.  So GHive separates the script into multiple commands, which are fed to JDBC serially. Commands are terminated by ";" or EOF.


BUILDING


  1. Install Apache Ant (I used version 1.7.1)
  2. Adjust any paths in build.xml
  3. Run:
    ant jar
  4. The resulting jar will be at ./ghive.jar


    USAGE


    Example 1 (simple script)
    Create a hive script with variables and comments. For example, create a file "hive/simple.ghive":

    -- My simple hive script
    -- For simplicity, I'm assuming the tables already exist

    ADD JAR ${HIVE_LIB}/hive_contrib.jar;

    ALTER TABLE user_ex ADD PARTITION (dt='${DATE}');

    INSERT OVERWRITE TABLE tmp
    SELECT username,
    -- first_name,
    -- last_name,
    email,
    phone
    FROM user_ex;

    In your Groovy Code

    GHive ghive = GHive.instance()
    // vars are a Map, the keys are case-sensitive. Remember, in
    // Groovy, symbols used as keys in a map don't need to be quoted.
    // I.e. [ FOO : 'foo' ] is equivalent to [ 'FOO' : foo ]
    def vars = [
                       HIVE_LIB : '/usr/lib/hive/lib',

                       DATE : '2010-07-02'
                      
    ]

    // The use of dumpScript is optional, and just writes a copy of the GHive
    // processed hive commands to disk.  The resulting file could be fed
    // directly to the hive shell via the -f flag.
    ghive.dumpScript("hive/simple.ghive",vars,"output/simple.hive")
    ghive.executeScript("hive/simple.ghive",vars)


    Example 2
    Run some query on each for a list of names and process the result in hive.  For example, a script "hive/getdata.ghive":
    -- A user may be in multiple groups
    SELECT username, group
    FROM user_group
    WHERE username = ${USERNAME}
    And groovy code:
    GHive ghive = GHive.instance()
    def usernames = [ 'gilbert', 'brook', 'xtreme' ]
    usernames.each { username ->
       ghive.eachRow("hive/getdata.ghive", [ USERNAME : username ]) { rs ->
         def group = rs.getString(2) // Like standard java sql, a 1 based index number
         println "$username is a member of $group"
       }
    }

    UNIT TESTING

    See "test/ghive/TestSimple.groovy". The main testcase method from that example is here:
    @Test
    public void testSimple() {

       // the q path is relative to the classpath
       def q = "simple.hive"

       def queries = ghive.parseScript (q, [ STORAGE_TYPE: 'TEXTFILE' ])

       ghive.execute(queries[0]) // create table
       ghive.execute(queries[1]) // load data
       def result = ghive.executeAndGetList(
                              "select id, value, amt from simple",
                              [ 'id', 'value', 'amt' ])

      assertEquals(3,result.size())

      def expected = [
            [ id:'1', value:'line1', amt:'0.2' ],
            [ id:'100', value:'line2', amt:'0.3' ],
            [ id:'50', value:'line3', amt:'0.4' ]]

      assertEquals(expected,result)
    }

    THOUGHTS ON THE DSL

    This is a simple, external DSL. Conceivably, I could use a full parser instead of the simple REGEXs and expand this into a full DSL with conditionals and loops and so on. But, as Example 2 shows, you can just use groovy to do this.

    If that's a typical use case, it might make more sense to create an internal DSL.

    0 comments: