Monday, December 27, 2010

Which operation def macro should I use in Cascalog?

I've been using the Cascalog query language for Hadoop map/reduce jobs for a while. The learning curve involves coming up to speed on the rich set of powerful operation creation macros (essentially, they are various techniques for creating user defined functions). As such, I've put together the following chart to help describe the different operation types in Cascalog. The idea is to provide some guidance on which type is appropriate for a particular job; along with examples and notes on usage and performance.

Cascalog jobs/queries are written in Clojure (a lisp like language that runs on the Java VM), so all examples below are Clojure code.



Let me know if you see any mistakes, or if you have suggestions for further details that could be added to make the chart more useful.

For reference, this thread also has a description of the def macros from Nathan, and more examples.

Saturday, October 9, 2010

How to unit test Apache Hive scripts, including those made for EMR

Apache hive has proven to be a useful tool for writing Hadoop Map/Reduce queries.  To do something a little more complex, requires some string manipulation.  If you're using Amazon's EMR version of Hive, it includes the ability to provide values for variables at run time.  But what if you aren't using EMR, or you are, but you want to test those scripts locally? 


GHive is a small Groovy wrapper around Hive. It provides a few basic functionalities. It can be particularly useful in the cases mentioned above.  I wrote this a couple of months ago, and I'm releasing it as open source now. The code is at http://github.com/mlimotte/GHive.

FEATURES


Variable Interpolation: The script can include variables of the form ${name} which are replaced by values from the vars Map. This is identical to the Amazon EMR extended functionality.  It's a erally useful feature, and makes sense that it should be available without EMR.  Ideally, it will be added to mainstream Hive, eventually.

Comments: If you have a long query, it is useful to include comments, just like you would in a long SQL query.  With Hive, you can do that for a script run with "hive -f", but not for queries that are executed through JDBC.  With GHive, you can include comments with a -- at the beginning of a line or whitespace followed by --.  You can not put a comment at the end of a line. That's because I didn't want to interfere with scripts that might include a string with -- in them. For example,
WHERE my_field = '--'
...would be a problem.  It would be nice to fix this with a smarter regex or parser, but it wasn't worth the trouble for me.

Dump processed script to a file: It can be useful to save a copy of the script, after the GHive pre-processor has worked on it. A simple use case is a complicated hive job that runs once a day. If you dump the script to a log directory, you can refer to it later if there are errors and even run the individual commands interactively through the Hive shell. There is a simple GHive API call to do this (see below).

Multiple commands through JDBC: The commands are sent through the Hive JDBC driver. If you have a set of distinct queries in a text file and try to feed them all to the JDBC driver at once, you'll notice that JDBC only accepts one command at a time.  So GHive separates the script into multiple commands, which are fed to JDBC serially. Commands are terminated by ";" or EOF.


BUILDING


  1. Install Apache Ant (I used version 1.7.1)
  2. Adjust any paths in build.xml
  3. Run:
    ant jar
  4. The resulting jar will be at ./ghive.jar


    USAGE


    Example 1 (simple script)
    Create a hive script with variables and comments. For example, create a file "hive/simple.ghive":

    -- My simple hive script
    -- For simplicity, I'm assuming the tables already exist

    ADD JAR ${HIVE_LIB}/hive_contrib.jar;

    ALTER TABLE user_ex ADD PARTITION (dt='${DATE}');

    INSERT OVERWRITE TABLE tmp
    SELECT username,
    -- first_name,
    -- last_name,
    email,
    phone
    FROM user_ex;

    In your Groovy Code

    GHive ghive = GHive.instance()
    // vars are a Map, the keys are case-sensitive. Remember, in
    // Groovy, symbols used as keys in a map don't need to be quoted.
    // I.e. [ FOO : 'foo' ] is equivalent to [ 'FOO' : foo ]
    def vars = [
                       HIVE_LIB : '/usr/lib/hive/lib',

                       DATE : '2010-07-02'
                      
    ]

    // The use of dumpScript is optional, and just writes a copy of the GHive
    // processed hive commands to disk.  The resulting file could be fed
    // directly to the hive shell via the -f flag.
    ghive.dumpScript("hive/simple.ghive",vars,"output/simple.hive")
    ghive.executeScript("hive/simple.ghive",vars)


    Example 2
    Run some query on each for a list of names and process the result in hive.  For example, a script "hive/getdata.ghive":
    -- A user may be in multiple groups
    SELECT username, group
    FROM user_group
    WHERE username = ${USERNAME}
    And groovy code:
    GHive ghive = GHive.instance()
    def usernames = [ 'gilbert', 'brook', 'xtreme' ]
    usernames.each { username ->
       ghive.eachRow("hive/getdata.ghive", [ USERNAME : username ]) { rs ->
         def group = rs.getString(2) // Like standard java sql, a 1 based index number
         println "$username is a member of $group"
       }
    }

    UNIT TESTING

    See "test/ghive/TestSimple.groovy". The main testcase method from that example is here:
    @Test
    public void testSimple() {

       // the q path is relative to the classpath
       def q = "simple.hive"

       def queries = ghive.parseScript (q, [ STORAGE_TYPE: 'TEXTFILE' ])

       ghive.execute(queries[0]) // create table
       ghive.execute(queries[1]) // load data
       def result = ghive.executeAndGetList(
                              "select id, value, amt from simple",
                              [ 'id', 'value', 'amt' ])

      assertEquals(3,result.size())

      def expected = [
            [ id:'1', value:'line1', amt:'0.2' ],
            [ id:'100', value:'line2', amt:'0.3' ],
            [ id:'50', value:'line3', amt:'0.4' ]]

      assertEquals(expected,result)
    }

    THOUGHTS ON THE DSL

    This is a simple, external DSL. Conceivably, I could use a full parser instead of the simple REGEXs and expand this into a full DSL with conditionals and loops and so on. But, as Example 2 shows, you can just use groovy to do this.

    If that's a typical use case, it might make more sense to create an internal DSL.

    Wednesday, September 22, 2010

    How to move files from HDFS into Amazon S3 - A shell script wrapper for distcp

    A common need for users of Hadoop and Amazon Web Services is to move whole directories of data from an HDFS cluster (whether inside or outside EC2) into Amazon S3.

    The traditional solution is distcp.  Amazon has a better version S3DistCp. A common idiom is to have a script run continuously looking for complete directories that are ready to be moved and then to create "marker" files to indicate that the data push is in progress or completed.  The marker files are a way of storing meta-data about the data directories.  You could do this in SimpleDB or your own SQL db, or a text file or anywhere else.  I like having it available right along side the data.

    I thought this was a general enough pattern that I would share a bash script to do it. The code contains comments about what it's doing and why, along with instructions on installation and configuration (as constants).



    Basically, it looks for directories in HDFS matching a certain pattern and moves them to S3 using Amazon's new distcp replacement, S3DistCp.

    It creates marker files (_directory_.done and _directory_.processing) at the S3 destination, so that it can synchronize when multiple instances of the script are running, and so that down-stream processes will know when the data directory is ready for consumption. This wouldn't be necessary if we were moving a single file, since it wouldn't show up at the destination until it was complete. But is necessary when moving directories of files, where some files might be completely transferred, but not all.

    Friday, September 10, 2010

    Why I switched from iPhone to Android HTC Evo

    The Android HTC Evo is a great phone. I've been talking about it for a while, but I finally switched from my iPhone 4 on AT&T to the HTC Evo on Sprint.

    Here are my reasons:

    1. Evo has a larger screen (meaning bigger buttons, easier to view), but not too big that it doesn't fit comfortably in my pocket.
    2. The Google Maps app is better on Evo. I can layer multiple search results on the same map (e.g. Show me Post Offices, Best Buys and directions to some location all at the same time). And the turn-by-turn navigation (including voice) is a nice kicker.
    3. Mobile WiFi hot spot. I can connect my laptop or ipad over wifi through the Evo to the Internet (no tether required)
    4. Better call quality... I frequently miss my calls on AT&T

    IPhone has some nice features, but in the end these are less important to me and my life style:
    1. Great battery life (with the Evo, I'm going to have to keep a spare charger at work and in the car)
    2. High resolution (I'm just more interested in size, than resolution)
    3. More attractive looking phone (guess you can't have everything)
    4. More apps (but Android still has the basics... and who really needs 50 different ToDo list apps anyway)
    Finally... don't get me wrong. I still love Apple. And I love my macbook pro and iPad.

    Friday, April 4, 2008

    How To Start Synergyc Automatically on Mac OS X 10.5

    Jan Varwig has a good post on using launchd to start the Synergy server (synergys).

    If you're like me, though, you have a desktop PC with keyboard and mouse and a MacBook. So,what you really need is instructions for running the client (synergyc). With a little playing around, I came up with this variation of Jan's instructions.




    Label
    net.sourceforge.syngery2
    OnDemand

    ProgramArguments

    /usr/local/synergy-1.3.1/synergyc
    -f
    -1
    -d
    WARNING
    100.1.2.3

    RunAtLoad

    ServiceDescription
    Synergy Client


    And, of course, you need to modify line 11 with the correct path for synergyc. And also line 16 with your IP address (I'm not sure if you can use a hostname instead of an IP address). Save this in /Library/LaunchAgents/net.sourceforge.synergy2.plist
    Owner should be root and permissions 644.

    I switched from the long options (e.g. "--no-daemon") to the short options (e.g. "-f"), because I was seeing an error that synergyc did not recognize option "--no-daemon".

    There is a comment from Cody Robbins about how to manually start the launchd process once you're all set up. That was helpful, but I needed the full filename for launchd to recognize it. I.e.:
    launchctl load /Library/LaunchAgents/net.sourceforge.synergy2.plist

    Cody's comment had the same command without ".plist".

    Friday, March 28, 2008

    How to move an Amazon AMI (EC2 image) from one account to another

    I've been having a lot of fun lately with Amazon EC2 and S3. I created a good base instance with many of the tools we needed, but it was all under my personal EC2/S3 account. In the short term, I can go ahead and submit expense reports to cover the charges, but I don't want to do that forever.

    To make a long story short, I attempted to download the image and then upload it under a corporate S3 account. Luckily a few web searches and I came across this post on transferring an AMI. This saved me a lot of time... without it, I may have tried several other solutions.

    In any case, I went with option 1 from the entry linked above. But they left out one crucial point. I thought it would be sufficient to start the AMI under my personal account, but bundle it using the the corporate keys. I did all that, but when I tried to start the EC2 image with my corporate credentials, it failed... basically saying that it wasn't my image. It turns out that you need to share the image with the new EC2 account. And then launch and bundle the image using that new account.

    After that it worked like charm.

    Tuesday, March 4, 2008

    Google Site was JotSpot

    Google launched the reworked JotSpot under Google Apps a few weeks ago. I've been using it successfully, so it's a welcome addition to the toolbox.

    One thing I miss though, the wiki syntax. The GUI featrues are nice, but wiki syntax is a powerful quickhand for advanced users. Wouldn't it be possible to support both? There is a button to switch to HTML mode, why not a button to enable "wiki syntax" editing features?