At InMobi, we have an internal Hadoop based system to support ad-hoc analytics needs of users. This system gives users a unified conceptual interface to the data collected from several sources, without burdening them to know about the schema, location, or the granularity at which data is stored. In the background, this system runs Hadoop jobs which can do map side joins, aggregations, and also have support for custom UDFs.
Currently the system described above uses a custom query language which allows users to express selections, projection formulae, joins, and aggregations. Given the recent interest in HQL based systems within the Big Data community like Hive (Stinger), Impala, and Shark, we felt it would be worthwhile to move to HQL, so that we could take advantage of community efforts around these systems. Also, since most of these systems can share a common schema through the Hive metastore, it becomes easier to have a single schema of all data in the company. However, in its current state HQL has certain limitations when compared with our system -
1. Currently the Hive metastore does not provide a way to express relationships between tables.
2. Users need to remember the schema and queries can become fairly complex if they involve joins or span data across tables.
3. Users query data at a relational level and not at a conceptual level which is easier for business users to understand.
To solve these problems, we have proposed the Cube abstraction in Hive. An in progress version of our work has already been proposed to the Hive community [HIVE-4115] With the cube abstraction, data is organized in data-cubes comprising of multiple fact and dimension tables. A cube consists of measures, and dimensions (which link the cube to dimension tables). Cube provides a logical abstraction for data which is Internally stored in multiple physical tables. For example, for fact tables, data can be aggregated at different time intervals and kept in separate tables (my_fact_minutely, my_fact_hourly etc.). Similarly dimension data, which is usually changed through user adjustments, can come in the form of snapshots which are effective as of a particular date. Users submit queries in Cube HQL (a slightly modified HQL), and internally the cube layer will automatically do the following -
1. Figure out which dimension tables contain data needed to satisfy the query.
2. Figure out the exact fact tables and their partitions based on the time range in the query. While doing this, it will try to minimize computation needed to complete the queries.
3. Automatically add aggregate functions to measures, and create the corresponding GROUP BY clause
4. Automatically resolve joins using the relationships between facts and dimensions
5. Rewrite the query into HQL and submit it to Hive for processing
In addition to this, the cube layer has some more goodies like support for hierarchical dimensions and support for multiple storages across multiple Hadoop clusters. At this stage, we have already proposed this feature to the Hive community, and its currently being evaluated in InMobi. We will cover the design, implementation details and an example walk through in a series of blog posts, so stay tuned!