big data, software engineering

Luigi and Google Cloud in production – retrospective

We’ve been running Luigi now in production for 3 weeks without any issues, so I thought it was time to share the code I wrote to link luigi with Google Cloud (see the previous article). I have to warn you though, it’s a first iteration and it’s far from perfect, still it will give you an idea of how to start linking the two product. I thought I would have more time to clean it up. Anyway, fetch it here: github/luigiext-gcloud

The tasks concentrate on using BigQuery because for storage, we can get by using the Google Cloud Storage connector for HDFS. But we’re planning to add some GCS tasks this sprint to speed up some of the processes.


I also wanted to show you our CPU graph of our nightly tasks as shown in the cloud dashboard. Till now most of the Hadoop MR tasks are written in Pig and started sequentially in a batch process. You can imagine that this all but optimal. Writing disaster recovery in batch is almost impossible. You see it clearly in the left part of the graph: Only a small slice of about 45 minutes the cluster is being used optimal (by 1 big pig script), the rest of the time most of the CPU’s are idle because the tasks where not big enough.

But if you look at the right side of the graph, you see that cluster is being use more optimal, reaching almost 100% CPU usage of the 32 nodes. This was be reached by using Luigi and setting multiple workers. For our relative small cluster 6 seems like the sweet spot. Granted, you see almost no CPU usage in the beginning and that’s because the building of the complete dependency graph of all our new Luigi tasks is quite slow. But this is mainly because checking the output file is done by starting the hadoop fs command-line utility for each output. That will be solved once I’ve written the native support for Google Cloud Storage. The last dip are all the BigQuery import tasks that run at the end.

I’m a lot happier since we adapted Luigi for running our data pipeline (the new part anyway) and if you’re looking for a tool for managing all you hundreds of Hadoop jobs make sure to look at Luigi.