This post announces the public unveiling of leveldb based backend for graphite.
A natural outcome of liking any tool is that its usage increases to the point wherein one pushed the limits so much that it seems like an overuse. For InMobi, this started out with increasing the number of metrics that each of our core ad-servers would publish. This number was a little shy of hitting the one thousand mark.
The other thing that we started testing was to capture some of the business related metrics using the same path. This is a rather unorthodox way of doing things in the age of big data. What was however appealing was the simplicity of the workflow; instrument the code in the right places with the yammer metrics library and then start seeing aggregated data right after the next release. This is where the “abuse” of graphite began. We started to pump close to a million data points per minute on a single carbon server and it would just trip over with no signs of recovery.
Things start to go foul
While it was obvious that the evidence of being able to push the carbon daemon to such an extent is almost unheard of, it seemed doable based on first principles given our server’s configuration
. The counterargument of it being ridiculous also seemed to be valid given carbon’s default configuration:
MAX_UPDATES_PER_SECOND = 500
MAX_CREATES_PER_MINUTE = 50
Whatever may be the historical reason, this seems woefully under calibrated. The first port of call was to bump up these numbers to observe the effect. It turns out that these numbers aren’t way off the reasonable limits. A little bit of hasty digging around revealed that there is a new experimental storage engine known as ceres that could be used in lieu of the standard whisper engine. This however required us to switch to a seemingly abandoned branch of the carbon know as megacarbon. That did not deter us and we did go ahead. The good part of switching to ceres is that the create time for new metrics did drop. However, it brought in its own set of problems.
Both whisper & ceres have a basic design induced limitation on the number of metrics that could be handled by a single server. Whisper would create a file for every metric and ceres creates a directory with multiple files for a given metric. A baffling problem that this results to begin with is that we ran out of disk space even when a dedicated 1TB drive. It turned out that we were exhausting the metadata space and not the actual space on the filesystem. Most linux file systems have a fixed number of inodes allocated for each partition. This is preset at the time of formatting and cannot be increased; however a value larger than the default can be chosen. A quick way to check this would be to use the good old “df” command but with the “-i” option. This shows inode usage as opposed to actual storage space usage. So we would run out of close to 2 million inodes when consuming less than a GB of data space. This was exceptionally worrying until we troubleshooted it for 2 reasons:
Root cause analysis
While we did address the above two concerns thus saving the server when carbon would misbehave, we still had not solved the actual problem. The real issue that struck us when going through this was something that I had encountered in my PHP days when fighting the notorious require_once induced APC slowdown. The carbon code had absolutely no persistent open file handles to the storage layer. So, if you are trying to store even a hundred thousand metrics per minute, it means making the following vfs calls in sequence: stat, open, read, seek, write, close. To make things worse, a lot of metric names tend to be hierarchical with as many as 8 components in its name. That would make the stat() call work even harder. The frequency at which this happens is so high that both the d-entry cache and also the file cache of the OS gets completely thrashed as we go through a single minute of the write cycle.
The experimentation begins
At this point, it was obvious that the acceptable solution had to be one wherein the amount of superfluous filesystem operations per write is reduced drastically. Given that whisper itself traces it origins from RRD which is nothing but a ring buffer, the natural choice was to explore solutions that would let us express this idea natively. Redis pops to mind given that it a great data-structure storage layer but its inability to use persistent storage meant that we could not use for our scale. The next choice was arrays in PostgreSQL. While programming against it turned out to be easy and it did avoid some of the issues that whisper & ceres had, it brought in its own set of evils. PostgreSQL never updates a row in-place even if the datatype is fixed width. This had the horrendous side-effect of copying over an entire row i.e. the entire ring buffer for a given metric as each new data point was inserted. So if you had planned for retention of say a thousand points per metric, then insertion of new point would actually end up taking 1000x the space that you thought would be needed! Of course, the theory goes that VACUUM would eventually release the older copies but that wasn’t happening fast enough but more importantly, the additional write volumes was reducing the disk throughput.
Despite all of this, it turned out to be a few multiples better than either whisper or ceres
The lesson learnt from both the PostgreSQL and the whisper/ceres battles was that a good solution required a good disk programming model and not a good memory programming model. While the write logic used by PostgreSQL did seem wasteful, the trigger from the experience (pun unintended) was to lean towards sequential I/O for the write path.
Whisper’s storage involved storing just the time and value and the metric name itself would be recorded just once. Ceres cut this down even further by storing the time just once per slice. A mixed metric, append only solution requires the verbosity of mentioning all 3 elements for each point that is being recorded but has the huge advantage of not making the disk arm swing wildly. This approach does compromise the read performance relatively speaking; but it turns out to be a non-issue as most of the metrics are “second level metrics” i.e. something that one drills down into only when a smaller set of primary metrics that happen to be observed at all times are out of place.
This led us to leveldb; one of the more popular embedded databases when it comes to raw speed. The basic implementation is straightforward: encode each metric name to fixed size internal short code, append with the timestamp for which the metric needs to be recorded and use this as the key. The value is trivially the metric reading. The additional component that was needed to support metrics exploration both on the graphite UI and also for wildcard support was to have a filesystem listing and globbing emulator around the metrics’ names themselves. This used to be straightforward in the case of whisper & ceres as their layout on the filesystem actually mirrored this convention.
Lastly, we had to build a jsonrpc reader bridge into the storage engine and expose that through the carbon server. This became necessary as leveldb does not permit a given database being opened from multiple processes at the same time; hence making it impossible for the graphite webapp to read the data. Initially, the graphite webapp also had to be patched to fetch the data. This was needed regardless of the leveldb induced complexities as there is no equivalent of the megacarbon branch as yet on the graphite web project and hence adding support for additional storage engine tends to be convoluted to begin with. However, that is no longer an issue as the finders for graphite web is now configurable!
The complete implementation can be found here: level-tsd
The carbon server is now able to run without breaking a sweat even when 500K metrics per minute is being pumped into it. This has been in production since late August 2013 in every datacenter that we operate from.
Enough people have solved the time-series data problem with big-table based solutions and yet it seemed like a non-option. InMobi was an early adopted and contributor to OpenTSDB when it first came out many years ago. Yet, we chose to stay away from it this time around due to a few philosophical reasons. The metrics that are being stored are critical to both the business directly and also to our site reliability teams. Having an unexpected downtime of this system would be crippling. Hence, the reliability of this very system becomes paramount.
We are looking at a scale up solution with active full-replica for HA. As long as the bounds are well within what a single machine can achieve and not pushing it too far, an embedded solution may be favourable. The important thing to note in a embedded storage is that it does not naturally scale out, doesn't provide data redundancy (mostly taken care of at application space), but provides the ability to colocate processing / aggregation along with data. This lets us squeeze a lot more out of a single host.
While this has been a fun ride, we seem to be running into the wall all over again. The bottleneck that we seem to be hitting is the inability of the cpython + twisted to use multiple CPU cores. Spawning multiple instances of the daemon is infeasible given leveldb’s restriction that a given db can be opened only from a single process. The objective now is the throughput limit
One of the ideas afloat is write a stripped down version of the carbon server in a language whose runtime could make use of multiple CPU cores. This an experimental version that has already been developed in go but it is still early days to talk about it. What we commit to however is to open that up as well should it turn out to be a stable & significant improvement.
 2* Intel E5-2640 , 4 * 600G 15K rpm disks in RAID-5 configuration, 4 * 8GB DDR3 RAM 1333 MHz