Tuning the garbage collector settings is a favourite pastime of many site reliability engineering teams as sub-optimal settings result in awkward response time behaviours. The first rule of trying to tune anything is to try and understand the existing behaviour of systems before tinkering around with it. Trying to be data driven when it comes to performance tuning also requires us to demonstrate correlation between the dependent variable (some perf related metric) and an independent variable (the value that we are trying to tinker). We shall now look the example from one of systems to uncover the complexities of undertaking such an activity. The system in consideration here is the core adserving component of InMobis business. The application in question is a standard java webapp running on dedicated bare metal serves. We prefer this hardware setup to ensure that host level resource contention is minimal to have great control over the server response times. The details of the JVM and its associated settings are made available here on the specific host whose metrics is being discussed in the remainder of this post.
~ $ java -version java version "1.6.0_26" Java(TM) SE Runtime Environment (build 1.6.0_26-b03) Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode) JAVA_OPTS="-Djava.awt.headless=true -Xmx9216M -XX:-OmitStackTraceInFastThrow -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly -XX:+UseConcMarkSweepGC -XX:NewRatio=4 -XX:SurvivorRatio=5"
Years of experience coupled with fundamental understanding of the nature of the application has shown us that the âweak generational hypothesis is very strongly pronounced in our situation. So our objective always has been to ensure that the objects that are supposed to die young actually do so i.e. they do not falsely get moved over to the older generations. This implies that the young generation garbage collection gets really busy and we need to watch out for its possible interference with the execution of the main application code. Hence we shall look at the impact of TP99 response time metric in conjunction with the activity levels of the GC. This graph shows 3 metrics of interest: the number of collections in the young generation (blue line), the size of the young generation (grey line) and the TP99 response times (green line). We have also shown the server load expressed in requests/second (red line) to allay concerns around the traffic pattern being an influencer. It can be seen that all the 4 values were roughly in steady state during the first 5 hours (14:00-19:00) of operation. The GC started working harder for the next 4 hours to maintain the same size of the young generation. One can presumably attribute to this to a slight change in the nature of the workload as we do business across the planet and different regional markets which would trigger a different data access pattern. The interesting thing to note is that the overall system performance i.e. latency numbers did not get affected despite the increased pressure on the young generation collector. However, there was a point (around 23:00) beyond which, the increase in workload for the GC started taking a toll on the response times. Things got progressively worse for the next 150 minutes and there came a point wherein the JVM felt that the degeneration was unacceptable and started taking drastic steps. The decision that was taken was to majorly resize the young generation. The JVMs assessment was that the compute cost (and side effect of pauses) was linked to the number of runs and not the magnitude of short lived objects that are being generated. It turned out to be cheaper to keep the garbage in a larger young generation and collect it infrequently i.e. a classical space-time tradeoff. The red line finally finds utility as we can see it collapsing when the automatic reconfiguration occured. This does not represent a drop in the incoming traffic but shows the JVM stalling and hence its inability to handle requests during that phase. While this seems like a random occurrence in the wild, the underlying aspects are very profound. We list them in no particular order here: 1. The JVM looked at 12 hours of profiling data before deciding to move to a completely different ergonomic setting. Consequently, one can say that understanding the performance characteristics of long running systems simply cannot be done by running a 15 minute load test. 2. Reconfigurations of this nature can also cause the long JVM pauses in addition the other well known sources of such pauses. 3. It is important to have a different system (nginx in our case) that is not encumbered with any application responsibilities to distinguish between a real drop in incoming request rate v/s the application choking and a drop in its ability to handle requests. While contributing to existing open source projects and releasing newer ones is considered cool, we also understand that an important part of making the OSS movement succeed is to also actually use other peoples good work and increase its user base. This is precisely what we did in this case. The technology used to record and plot this is readily available off the shelf. We used the yammer metrics library to instrument the application. Specifically, the default web-application filter thats ships as part of the library was added to the webapp configuration. The data was transmitted to a standard graphite cluster at a 1 minute frequency using the connector that ships as part of the same metrics library. Stay tuned for more posts on real world observations in the field of application performance and site reliability engineering on our Reflections blog.