mathjax

Monday, September 9, 2013

Get Pig LogicalPlan

Recently I've been wanting to get ahold of the logical plan (a graph representation) for a pig script without running it. The largest reason is that the logical plan is a fairly language and platform agnostic representation of a dataflow. Once you have the logical plan I can think of several fun things you could do with it:


  • Serialize it as JSON and send it to any number of arbitrary tools
  • Visualize it in a web browser
  • Edit it with a web app
  • Compile it into an execution (physical) plan for arbitrary (non-hadoop map-reduce) backend frameworks that make sense (storm, s4, spark) 
Ok, so maybe those are the fun things I actually plan on doing with it, but what's the difference?

Problem

Pig doesn't make it easy to get this. After spending several hours digging through the way pig parses and runs a pig script I've come away somewhat shaken up. The parsing logic is deeply coupled with the execution logic. Yes, yes, this is supposed to change as we go forward, eg PIG-3419, but what about in the mean time?

Hack/Solution

So, I've written this little jruby script to return the LogicalPlan for a pig script. Right now all it does is exactly the same as putting an 'EXPLAIN' operator in your script. However, since it exposes the LogicalPlan, you could easily extend this to do whatever you like with it.