InMobi has been using big-data technologies (Apache Hadoop and its family) for well over 2.5 years for storing and analyzing large volumes of serving events / machine generated logs. InMobi receives in excess of 10 billion events (ad-serving and related) each day through multiple sources/streams originating from over ten geographically distributed data centers. Background As we have come to realize, just using the right serialization / storage format or processing technology is not adequate in managing complex big-data processing environment such as ours. Issues discussed below are common across most of our processing pipelines or data management functions and it was required to take a platform approach towards solving for these.
Introducing Apache Falcon (formerly known as Ivory) To start with, we had a single central data center where all the processing happened and we're looking to explore cheaper and more effective ways of processing this data. We used a simple in-house scheduler to manage job flows in our environment then. We realized that to be able to process data in a decentralized fashion, we need to have the complexity pushed into a platform and allow the engineers to focus on the processing / business logic. Besides data processing needs, all other data management functions also became de-centralized or repeated in such a setup. So we invested time to build a data center location aware processing and data management platform, Apache Falcon. Apache Falcon platform allows users to simply define infrastructure endpoints (Hadoop cluster, HBase, HCatalog, Database etc), logical tables/feed/datasets Â(location, permissions, source, retention limits, replication targets etc) and processing rules (inputs, outputs, schedule, business logic etc) as configurations. These configuration are expressed in such a way that the dependencies between these configured entities are explicitly described. This information about interÂdependencies between various entities allow Falcon to orchestrate and manage various data lifecycle management functions.
At the core of Falcon is the dependency graph between various participating entities and the entity definitions themselves. Falcon stores these entity definition and the relationship in a configuration store. Upon user scheduling a process or a data-set (replication, archival, retention tasks etc) in Falcon, the system, uses the stored configuration details to construct a workflow definition and schedule on the underlying workflow scheduler (Apache Oozie). However Falcon system, injects within the actual workflow a task / node to notify itself when the workflow itself is successful or failed. This control message is then used by Falcon to take further actions such as retries on failures, handling changes in inputs etc.
In its current form Apache Falcon uses Apache Oozie for orchestrating â€œuserâ€ workflows, while pre-defined workflow templates are used for replication, retention and other data management actions. The choice of Apache Oozie was mainly driven by the daily availability feature that Oozie comes with. Apache Falcon provides notification regarding availability of each dataset produced through it or removed through it (be it through replication across clusters, generation of new data through a process/pipeline or retention). Falcon is integrated with standard JMS based notification for data availability. Users of the Falcon system can subscribe to these notification events for any downstream processing once data becomes available. In a colo aware mode, Falcon bundles with it a component called Prism, which simply is a proxy over the individual Falcon instances in each of the data centers. This allows operators and engineers to access Falcon globally from a single location and not look at it as a distributed and decentralized installation. This helps significantly in reducing the operability overheads in managing decentralized processing.
Apache Falcon has been deployed in production within InMobi for nearly a year and is being widely used for various processing pipelines and data management functions, including SLA critical feedback pipelines, correctness-critical revenue pipelines besides reporting and other applications. We will cover details about some of these applications in a subsequent blog. This project is now incubated through Apache Incubator.