InMobi has been using big-data technologies (Apache Hadoop and its family) for well over 2.5 years for storing and analyzing large volumes of serving events / machine generated logs. InMobi receives in excess of 10 billion events (ad-serving and related) each day through multiple sources/streams originating from over ten geographically distributed data centers. Background As we have come to realize, just using the right serialization / storage format or processing technology is not adequate in managing complex big-data processing environment such as ours. Issues discussed below are common across most of our processing pipelines or data management functions and it was required to take a platform approach towards solving for these.
- In a geographically distributed setup, assuming WAN networks to be highly available is neither practical nor cost-effective. In such an environment, the flow of data across data centers might be stalled or delayed. This might come in the way of making critical feedback generation / learning processes or business intelligence reporting data available on time.
- There might be different types of data processing requirements. The ones where the timeliness of the pipelines is more critical than the correctness (ex. Feedback pipelines), while in case of others correctness and completeness are more important (BI / Reporting). Particularly for the second class of applications, if the input data were to change due to delays, it should automatically reprocess them. Now this needs to automatically happen for all downstream consumers of the input data whether direct or indirect.
- As summarized data is being generated through various processing technologies such as Apache Pig/Apache Hive/streaming/map-reduce etc, the same needs to be replicated onto other data center(s) for disaster recovery (DR) and business continuity purposes on a continuous basis.
- Be it for a operability requirement to nail down potential SLA breach due to an upstream issue or identifying impact of a schema change of a particular event stream, lineage of one data set with another can be very handy. This will let production engineering team holistically assess and react to issues.
- As new data constantly flows into the hadoop cluster, it is important to archive data onto a low cost storage to comply with legal requirements, while purging them from the cluster. It is important to consider the overall retention period of a particular data stream/summary while archiving the data, replicating the data for DR.
- While it is simpler to ship all events being recorded across various data centers to one central location for processing, it might actually be both cheaper and faster to process them locally and ship minimal summaries to a central location. This would mean that the processing has to repeated in each of the locations. It would be much simpler if it is possible to declare this intent and have this distributed processing with a global roll-up orchestrated automatically.
Introducing Apache Falcon (formerly known as Ivory) To start with, we had a single central data center where all the processing happened and we're looking to explore cheaper and more effective ways of processing this data. We used a simple in-house scheduler to manage job flows in our environment then. We realized that to be able to process data in a decentralized fashion, we need to have the complexity pushed into a platform and allow the engineers to focus on the processing / business logic. Besides data processing needs, all other data management functions also became de-centralized or repeated in such a setup. So we invested time to build a data center location aware processing and data management platform, Apache Falcon. Apache Falcon platform allows users to simply define infrastructure endpoints (Hadoop cluster, HBase, HCatalog, Database etc), logical tables/feed/datasets Â(location, permissions, source, retention limits, replication targets etc) and processing rules (inputs, outputs, schedule, business logic etc) as configurations. These configuration are expressed in such a way that the dependencies between these configured entities are explicitly described. This information about interÂdependencies between various entities allow Falcon to orchestrate and manage various data lifecycle management functions.
At the core of Falcon is the dependency graph between various participating entities and the entity definitions themselves. Falcon stores these entity definition and the relationship in a configuration store. Upon user scheduling a process or a data-set (replication, archival, retention tasks etc) in Falcon, the system, uses the stored configuration details to construct a workflow definition and schedule on the underlying workflow scheduler (Apache Oozie). However Falcon system, injects within the actual workflow a task / node to notify itself when the workflow itself is successful or failed. This control message is then used by Falcon to take further actions such as retries on failures, handling changes in inputs etc.
In its current form Apache Falcon uses Apache Oozie for orchestrating â€œuserâ€ workflows, while pre-defined workflow templates are used for replication, retention and other data management actions. The choice of Apache Oozie was mainly driven by the daily availability feature that Oozie comes with. Apache Falcon provides notification regarding availability of each dataset produced through it or removed through it (be it through replication across clusters, generation of new data through a process/pipeline or retention). Falcon is integrated with standard JMS based notification for data availability. Users of the Falcon system can subscribe to these notification events for any downstream processing once data becomes available. In a colo aware mode, Falcon bundles with it a component called Prism, which simply is a proxy over the individual Falcon instances in each of the data centers. This allows operators and engineers to access Falcon globally from a single location and not look at it as a distributed and decentralized installation. This helps significantly in reducing the operability overheads in managing decentralized processing.
Apache Falcon has been deployed in production within InMobi for nearly a year and is being widely used for various processing pipelines and data management functions, including SLA critical feedback pipelines, correctness-critical revenue pipelines besides reporting and other applications. We will cover details about some of these applications in a subsequent blog. This project is now incubated through Apache Incubator.