Over the past week I’ve been developing a presentation/demonstration for training people on performance monitoring using the Webmetrics GlobalWatch platform for monitoring. As part of that training I identified 5 key graphs that prove to be invaluable when analyzing performance data. To present these graphs I developed a slide that would allow my audience to draw by hand the 5 graphs and list the value that each graph provides. So here is a copy of the slide filled out by me:
To clarify, these graphs are specific to a monitoring service that monitors a multiple step transaction (ex: purchase on Amazon.com, user log-in, or user registration) however similar graphs should be available for other types of monitoring (stream performance, web service monitor, URL monitor, DNS monitoring). Basically, I recommend that whenever you look to implement performance monitoring you should make sure that the presentation layer of that monitoring allows you (at least) to view your data in fashions similar to these 5 graphs:
- Transaction Average Load Time – A graph that shows you the high level view of the data you are collecting, in this case a view of the average load time throughout the day of the performing the transaction (or viewing a single URL). This acts as an executive summary as well that helps to show basic trends in the performance of the transaction over time. Optionally it is of value to display any errors that occurred in the graph. Also, it helps if the graph can easily be drilled-down on so that it does not lock you into only a high level view (the drill-down will allow you to see the individual sample values that make up the displayed average). Again, the primary benefit of this graph is a high level view that allows you to look for trends in performance.
- Transaction Step Averages – Another view at average data, but this time we’re drilling down to the individual steps that make up the transaction. The example drawing shows a 6 step transaction with errors on steps 1 and 4 (and a performance bottleneck at step 4 as well). The benefit of this graph is that you can now breakdown the performance of the steps that make up the transaction being monitored. However, it’s still an average. So it’s going to give us a high level view that allows us to identify what steps in a transaction can use improvements in performance as well as breakdown a complex set of data. While errors on a per step basis should be an option to the graph drill-down capabilities would probably be overkill and presentation would be clunky at best.
- Transaction Steps Over Time – A graph that shows the average performance over time for each step in the transaction relative to the other steps. This graph is similar to the first graph discussed but it breaks down the data so that we can look at trending for each individual step (as well as see how performance degradation affects individual steps in the transaction – as opposed to the transaction as a whole). Errors should again be an optional parameter to the graph but errors should be distinguished by what step it occurred on since the primary data plotted is per step performance. This graph would only add value for a service that monitors multiple steps (either a transaction or a number of URLs).
- Full-Page Breakdown – Arguably the most important graph that external performance monitoring can generate. If you’re looking at a monitor solution that doesn’t provide it (or…gasp: doesn’t even collect full-page data) then you are not getting the true value of external performance monitoring. The full-page breakdown is a waterfall style graph that shows the download/rendering characteristic of a web page (or pages). The full-page displays performance data for every item that makes up a web page (images, CSS, JavaScript files, etc.). Whenever you record performance of these items you should consider at least the basic performance metrics: HTML download, redirection, network latency, and transfer time). The full-page breakdown is a great reflection of browser fidelity (which is a representation of how accurately a monitoring solution emulates an actual client – such as a browser). This is generally the lowest level of granularity you can get with external monitoring and it allows you to see what impact components (JavaScript, images, etc.) have on the overall performance of your web applications and pages.
- Uptime & Average Load Time - This graph is central to external monitoring solutions (and would only exist on massively deployed internal solutions). The focus of this graph is on providing performance metrics on a per location basis. Since external monitoring is done from global locations outside your firewall you will see different performance for different regions (samples originating further away from your servers will take longer to traverse the Internet). Monitoring solutions that are deployed in house suffer from proximity between the resource being monitored and the tool that is monitoring…this graph will show you what is the performance from different locations (the line drawn across the graph) as well as the uptime from each location (the bars in the background of the graph). A common usage of this graph is to evaluate the benefits of a CDN. If a site is not using a CDN you would expect to see a rightward trend in performance improvement (that is the line representing performance would descend to the right for locations that are closer to the server being monitored). When a CDN is used you would expect to see a very consistent performance line because accessing a site/resources provided by a CDN will reduce overall performance no matter where the client is accessing the site/resource from (CDN = content delivery/distribution network. This is a network of servers that push content to the edges of the networks around the globe so that requests for the content don’t have to travel far…thus reducing latency times).
One important note on the last graph. I mentioned that the Uptime & Average Load Time graph is good for evaluating CDNs, this is true if the external monitoring solution is a impartial third party. Some monitoring services may be partnered with CDNs such that the CDN refers customers to the monitoring company in exchange for performance metrics that are skewed to show better results then what are actually achieved. There isn’t anything tricky going on here…it simply has to do with where the monitoring agent is located in comparison to a CDN server that is providing content. If they’re in the same data center then the performance data collected is somewhat biased and will show better performance improvements than most people will experience in using the CDN. Definitely check and find out what the context is behind data that you’re using to evaluate CDNs or any other web technology that promises to improve your performance.
