Over the past week I’ve been developing a presentation/demonstration for training people on performance monitoring using the Webmetrics GlobalWatch platform for monitoring. As part of that training I identified 5 key graphs that prove to be invaluable when analyzing performance data. To present these graphs I developed a slide that would allow my audience to draw by hand the 5 graphs and list the value that each graph provides. So here is a copy of the slide filled out by me:
5 Key Graphs
To clarify, these graphs are specific to a monitoring service that monitors a multiple step transaction (ex: purchase on Amazon.com, user log-in, or user registration) however similar graphs should be available for other types of monitoring (stream performance, web service monitor, URL monitor, DNS monitoring). Basically, I recommend that whenever you look to implement performance monitoring you should make sure that the presentation layer of that monitoring allows you (at least) to view your data in fashions similar to these 5 graphs:
- Transaction Average Load Time – A graph that shows you the high level view of the data you are collecting, in this case a view of the average load time throughout the day of the performing the transaction (or viewing a single URL). This acts as an executive summary as well that helps to show basic trends in the performance of the transaction over time. Optionally it is of value to display any errors that occurred in the graph. Also, it helps if the graph can easily be drilled-down on so that it does not lock you into only a high level view (the drill-down will allow you to see the individual sample values that make up the displayed average). Again, the primary benefit of this graph is a high level view that allows you to look for trends in performance.
- Transaction Step Averages – Another view at average data, but this time we’re drilling down to the individual steps that make up the transaction. The example drawing shows a 6 step transaction with errors on steps 1 and 4 (and a performance bottleneck at step 4 as well). The benefit of this graph is that you can now breakdown the performance of the steps that make up the transaction being monitored. However, it’s still an average. So it’s going to give us a high level view that allows us to identify what steps in a transaction can use improvements in performance as well as breakdown a complex set of data. While errors on a per step basis should be an option to the graph drill-down capabilities would probably be overkill and presentation would be clunky at best.
- Transaction Steps Over Time – A graph that shows the average performance over time for each step in the transaction relative to the other steps. This graph is similar to the first graph discussed but it breaks down the data so that we can look at trending for each individual step (as well as see how performance degradation affects individual steps in the transaction – as opposed to the transaction as a whole). Errors should again be an optional parameter to the graph but errors should be distinguished by what step it occurred on since the primary data plotted is per step performance. This graph would only add value for a service that monitors multiple steps (either a transaction or a number of URLs).
- Uptime & Average Load Time - This graph is central to external monitoring solutions (and would only exist on massively deployed internal solutions). The focus of this graph is on providing performance metrics on a per location basis. Since external monitoring is done from global locations outside your firewall you will see different performance for different regions (samples originating further away from your servers will take longer to traverse the Internet). Monitoring solutions that are deployed in house suffer from proximity between the resource being monitored and the tool that is monitoring…this graph will show you what is the performance from different locations (the line drawn across the graph) as well as the uptime from each location (the bars in the background of the graph). A common usage of this graph is to evaluate the benefits of a CDN. If a site is not using a CDN you would expect to see a rightward trend in performance improvement (that is the line representing performance would descend to the right for locations that are closer to the server being monitored). When a CDN is used you would expect to see a very consistent performance line because accessing a site/resources provided by a CDN will reduce overall performance no matter where the client is accessing the site/resource from (CDN = content delivery/distribution network. This is a network of servers that push content to the edges of the networks around the globe so that requests for the content don’t have to travel far…thus reducing latency times).
One important note on the last graph. I mentioned that the Uptime & Average Load Time graph is good for evaluating CDNs, this is true if the external monitoring solution is a impartial third party. Some monitoring services may be partnered with CDNs such that the CDN refers customers to the monitoring company in exchange for performance metrics that are skewed to show better results then what are actually achieved. There isn’t anything tricky going on here…it simply has to do with where the monitoring agent is located in comparison to a CDN server that is providing content. If they’re in the same data center then the performance data collected is somewhat biased and will show better performance improvements than most people will experience in using the CDN. Definitely check and find out what the context is behind data that you’re using to evaluate CDNs or any other web technology that promises to improve your performance.
In the world of Load Testing there are three potential perspectives you can test from: Internal, external, and Last Mile. Which perspective you choose to monitor from really depends on your goals (just as it does with performance monitoring). Here is a breakdown of the three perspectives and the PROS and CONS of each.
Internal – This type of load test is performed from inside the network that is being tested. It provides the best flexibility (because you manage the whole process internally) but is also the most complex and involved. Here are the PROS and CONS:
- PRO: You have total control over the process of testing (you write the scripts, you setup the infrastructure).
- PRO: Most likely you will be able to generate the load you need without having that level of load be disrupted (though you’re not 100% secure from disruption in the level of load you generate).
- CON: This is normally a very expensive approach since you have to pay for licenses, maintenance, infrastructure, training, support, etc.
- CON: Complexity in deploying/managing infrastructure.
- CON: Doesn’t stress bandwidth issues and immediate ISP issues.
External – This type of load test is performed from outside the network that is being tested. A slight reduction in flexibility (if you’re not deploying the infrastructure yourself) but an increase in scope of the test (you can test more infrastructure) as well as a decrease in setup time. Here are the PROS and CONS:
- PRO: Emulates traffic as your network (and ISP) will see it.
- PRO: You test under known conditions and in a consistent fashion.
- PRO: Amount of load to apply during the test is essentially unbounded.
- CON: For cost effective tests the load is generally an emulation of real world users.
Last Mile – This is similar to external load testing but it is a bit more representative of the end users that will actually use the site. There is a bit more reduction in flexibility (and security) but you gain a bit more perspective into what users will experience when they access your site under load. Here are the PROS and CONS:
- PRO: Emulates traffic as your network (and ISP) will see it.
- CON: You will generally collect information that will not be of help in making decisions on improvements to your infrastructure.
- CON: Problems with public networks and computers being used to generate load are more likely to disrupt the generated load during the test.
As always the right infrastructure to use depends on your needs and what you want to accomplish with the load test. Probably the three most important considerations of a load test are:
- Generating consistent levels of load during the test – First, you want to make sure the level is correct. It is a waste of money to apply 400,000 concurrent users during a test if you know from preliminary tests that the server is probably only capable of handling 40,000 concurrent users. This is where it really helps to have support or use a third party for load testing because they have experience with load tests and scoping their parameters (the amount of load being the most critical). Generating too little load is also a problem (for similar reasons). In addition, you do not want to have a disruption in the load during the test. A consistent ramp up in a load test is critical for a successful test. If you load generation network fails halfway during the test you will lose valuable time and resources re-running the test.
- The correct emulation of traffic – This is more a matter of taste but it has to do with how accurately you want your generated load to reflect your actual end users. In load testing this is generally not as critical as it is in performance monitoring. For example, during a load test it is overkill to require that the emulated users actually reflect a real browser (say IE 7) because you’re really only concerned with generating the load (not how it’s displayed in a browser). If your needs to require that the end user perspective is accurately represented then you will want to consider more expensive solutions that do that. Remember, the primary objective of a load test is to test the server (not the client).
- Define what data is important – Make sure that you know before the test what data you expect to get and how you will utilize it. For example, having information about your CPU, bandwidth, and memory utilization can help you tune your systems for improved performance. However, knowing that an ISP for clients in China is slow or unavailable isn’t going to help you improve your performance (and it could potentially disrupt the flow of concurrent users during the test).
When talking about web applications, the Eco-System is a term used to define the ever increasing complexity and disparity of a web application (as it’s delivered to the users). An eco-system approach defies traditional web based development methodologies such as managing all software and hardware in-house and instead adopts development concepts from OOP (code reuse being the biggest). This change in behavior has made web based development far more flexible (and reduced time to market) but has dramatically complicated the management of the resulting application (and in particular performance monitoring and SLA reporting). With various components and pieces of functionality scattered across the internet there needs to be a common platform for providing performance monitoring and reporting on SLAs. The Webmetrics eco-system management platform does just this and it does it by considering these main points:
- Monitoring should not be duplicated – Just like code/functionality you do not want to duplicate your efforts. Multiple departments these days need performance data (IT, QA, Marketing, Executives) and they all need to be on the same page. A common misconfiguration is to have multiple services monitoring the same resource but with different test parameters. The problem here is that if alerts and errors are not standardized then there is no way to build consensus among different departments. Worse is if these values are being used to support an SLA. If the company isn’t consistent with how the SLA is calculated then the company as a whole cannot (and will not) meet the SLA. Eco-System management makes it easy to setup a single set of monitoring services that can then be shared among different groups in the organization.
- Thoroughly monitor web applications – You lose control as more and more functionality moves outside your firewall. Your best bet is to not ignore these components when you’re monitoring. With applications mashed-up from disparate components you need to make sure that not only is your application as a whole monitored, but all the bits and pieces that make it up are monitored as well. This will help you locate and diagnose problems faster. The Webmetrics eco-system does this by allowing you to monitor your web service calls (from the perspective of your network) and view their performance side-by-side with your other monitoring data.
- Keep your third parties honest – If Amazon says that EC2 is available 99.999% of the time what is your guarantee that at the end of the month (or year) it was up that amount of time? Outages can be subjective (5 minutes of downtime can feel like eternity) and if you are put in a position to question an SLA you have with a vendor it’s best to have some sort of data to bring to the table (not just what your vendor brings). The data can also be used to help you negotiate with your business partners at the end of the year (or contract). The Webmetrics eco-system does this by allowing you access to comprehensive reports on performance and availability (uptime).
- Build trust with users – Your users look at you the same way…they want to know that they’re getting what they pay for. A lot of companies are starting to see the benefit of building trust with users through behaviors that promote transparency of uptime. The Webmetrics eco-system allows you to extend sharing of data to your users. It empowers them to keep track of the performance of services they’re offering. I suppose that’s good or bad depending on the stability of your systems. For more information on transparency check out this blog.