PinTail - Tail and Pin Event Streams from Apache Hadoop Clusters

Posted on July 24, 2013
By Inderbir Singh

Background/Motivation

At InMobi Conduit is used extensively to move data in a streaming fashion from applications to grid. The appetite for consuming data is growing within the company as we deliver significant scale (more data aids in creating smarter and richer user experiences).

Previously most consumers used to leverage Apache Falcon for consuming data with jobs getting triggered at pre-defined times and have capabilities to gate on data arrival at grid at predefined locations.

Then arose use-cases which require streaming data consumption; specifically at InMobi we had teams like fraud detection, real time feedback to ad serving who wanted to consume events as they arrive on GRID at sub-minute level latencies directly into the application without having to worry about the nuances of using Hadoop and PINTAILwas born.

Features

  • Streaming view of data for a stream as it arrives, sourcing it from multiple clusters a.k.a tailing a stream.
  • Checkpoint a stream a.k.a pinning a stream
  • Partition a stream across multiple consumers within a group for better throughput.
  • Custom input formats.
  • Decoupled from Conduit. Seamless support for Conduit Streams and any other event stream generated on Hadoop.

10000FT view

Pintail comes packed as a asynchronous java library with an iterator alike interface thereby allowing applications to stream messages from multiple clusters/multiple data centers in parallel.

Stream near Real-Time Events

Being one of the primary goals, its evident this use-case is supported as a first class citizen. There are certain nuances in supporting this use-case with primary one being, events arriving at Conduit collectors get flushed to HDFS within sub 10 sec latencies and every minute boundary a file gets cut for a stream and is made available for batch consumption at an immutable location.

A laggard/slow consumer consuming events could get stuck due to reasons unknown to Pintail and in the interim the file location from where pintail was streaming real time events gets moved for batch consumption to an immutable location. Pintail in that case will identify the same and jump its head to the batch location to continue streaming events and eventually based on the consumer speed might jump its head back to the near real-time collector location at a later time.

For low latency event delivery PinTail reads the HDFS files concurrently while the collector is writing them. It keeps track of the last flushed location keeps its head moving till that file is closed and then moves to the newly created one.

A consumer has a lifecycle of consume -> process for every event or batch of events. Since the process time is application specific its imperative that applications which wait for other events/signals/CPU intensive tasks, might not able to be able stream and process events at the event generation rate. Pintail supports consumers for a single stream wherein all consume mutually exclusive events for. An application with a higher process cycle or a stream having high volume can partition the stream across multiple consumers and scale the consumers independently.

Checkpoint a stream

A stream can be check pointed at regular intervals or prior to graceful shutdown of an application and application can start reading messages again once up from the last checkpoint.

Custom Input Format

There isnt any binding on the input format of the file from where pintail will stream messages. A custom input format like TextInputFormat, KeyValueInputFormat, RCFileInputFormat, etc. can be specified and pintail will accordingly stream messages using the specified Record Reader all with a configuration. It is decoupled from Conduit and can be used as a general purpose-streaming library as long as files are published in minute directories following a pattern [{prefix}/YYYY/MM/DD/HH/MM/], which is standard across the Hadoop eco-system.

Usage

We have been running Pintail in production at InMobi for almost a year now for multiple streams. It has been able to successfully address the existing gap of streaming consumers from Hadoop at InMobi.

Acknowledgements

I would like to thank Amareshwari Sriramdasuand other members of data platform and Adserving teams at InMobi for making this project a success. I would also like to acknowledge Facebook Ptail and Apache Kafka consumer library that served as a source of inspiration while building Pintail especially in the areas of checkpoint and consumer groups.

About the Author

Inder Singh, works as a developer in the Data Platform team at InMobi which takes care of All Things Hadoop @ InMobi.



Back to Top