The PerfMon Blog

November 12, 2008

How to define a monitoring transaction

The first step in monitoring the performance of an on line asset is to determine what really needs to be monitored.  This is usually pretty easy if you’re just looking to monitor the availability and performance of a website.  In this case you just need to monitor a single URL (or a subset of the most important URLs) and not every URL on the site (since all web pages are most likely hosted on the same infrastructure).  The example above shows that focusing your attention on the essentials can help reduce costs.  For example, if it costs $10 to monitor a single URL and you 1000 pages on your site, that’s $10,000 you have to cough up to monitor every page.  But the return on each page over a certain threshold is less and less (i.e. The core value you’re getting from the monitoring is that the site is up and running and if 10 pages are not available it’s more than likely that all pages are not available).  What about in the case where you want to monitor a complex web application?

It requires a similar concern because the more you monitor the more it costs.  However, in this case there is a bit more up front analysis that you have to do in order to determine the scope of what is to be monitored.  In order to monitor a web based application a script is required to tell the monitoring platform what steps to perform and how to interact with the application.  The steps that make up that script constitutes a Transaction and directly impact the cost of monitoring (most monitoring solutions will use a metered billing approach which uses a single credit per step of a transaction).  Now we can see that there is a direct correlation between how much a monitoring solution costs and the scope of that monitoring (number of steps goes up, so does the price).  It’s important to note that even if metered billing isn’t used, most monitoring platforms have a concept of increasing the price of the solution as the number of steps increases (ex: $100 for 1 to 3 steps, $200 for 4 to 6 steps, etc.).  It’s just a fact of life, those cost increases are sometimes to recoup processing power of performing the steps but mostly the increase is to offset the costs of storing, backing up, and reporting data.  So, how does one save money?  One defines a transaction as only the steps necessary to test functionality.  Often the following mistakes are made:

  • Monitoring the mundane (or for the wrong reasons).
  • Monitor duplicate functionality (too much).
  • Monitor too little (taking these recommendations too far).

Let’s look at each of these situations and see how they impact costs as well as how they affect the bottom line (collecting actionable monitoring data).  In each case we will consider only monitoring of transactions:

Monitoring the mundane – This is generally the product of an organization that hasn’t thoroughly thought out the goals of monitoring.  A transaction that I would consider mundane is one that doesn’t really have an end goal and just meanders around the website.  For example, a transaction that clicks through each menu item in the left nav bar is probably mundane.  Sure there’s an argument for why to do that, maybe lots of revenue is generated from the left nav bar, or maybe that’s the only navigation available for the site.  But in actuality this is really a QA problem and should be addressed as such.  It’s common in the field of computer science that the later a problem is discovered the more it’s going to cost to fix it.  Which is definitely the case here: A QA process that occurs right after development could have caught any broken links or JavaScript funkyness more efficiently then a costly monitoring solution after code has been deployed.

Monitor duplicate functionality – Sometimes this is a hard one to get around.  But basically you need to make sure that your monitoring transactions are mutually exclusive.  Don’t monitor the updating of a web based calendar in two separate transactions when one will do.  Another case is when similar methods are invoked in a single transaction.  For example, if you have a tool that configures a product and does so in 20 steps it’s probably overkill to perform all configurations (since they all probably access the same front-end and back-end functionality).  Have the transaction perform a couple of configurations and then complete the transaction (i.e. purchase, or whatever the result of the transaction is).  In this last case the duplicate functionality is a bit obscure…on the front-end the functionality looks different (configure a tire vs configure a stereo) but on the back-end the functionality is more than likely the same (accessing the same database through the same web service) and therefore all you’re really doing is testing more client side code execution (which probably should have been done during the QA process again).

Monitor too little – If you start to get too carried away with the recommendations I’ve made you could end up shooting yourself in the foot.  For example, in the last section I gave the example of an application that configures a product, furthermore that application uses the same back-end technology for each step of the configuration.  But it very well could be the case that third party functionality is embedded in the configuration tool (one step could be hosted by you while another step makes a request to a third party).  In that case maybe it does make more sense to monitor additional steps (though it could probably be monitored more efficiently by breaking out that third party contents monitoring into it’s own monitoring service).  The end result is that you’re looking for efficiencies in monitoring that will help reduce cost while NOT altering the data set you expect to get from the monitoring.

To summarize, you want to focus your monitoring so that you can achive your goals in improving performance without creating convoluted and expensive data sets.  Also, you want to be aware of not getting to zealous with efficiency and stripping your dataset of all its value.

November 5, 2008

Blog Performance

I’ve been receiving a lot of alerts from my monitoring system for this blog so I thought I’d do a bit of investigation.  I checked my monitoring data to see if there was a trend in performance degradation and sure enough I found it in a graph that lays out the page load time and errors.  Here’s what it looks like:

Performance Metrics

Performance Metrics

It doesn’t show any major overall performance increase (that is a left to right increase in the value of the dark green line) but I do notice that errors have started to increase with a culmination of 31 errors on Tuesday (11/4).  Though I don’t have any analytics data to prove this I’m pretty sure it had to do with all the blogging (and blog reading) that was going on nationwide with the presidential election.  All that traffic on WordPress certainly would have affected my blogs performance (or at least mine and every other blog that is on the same hardware).  It’s not just WordPress but other sites containing news about the election were hammered with users looking for election results (here’s an article on Akamai’s traffic for yesterday).

When I drilled down into the monitoring results I found that the errors I was experiencing were due to timeout errors.  In my monitoring software i’ve established a threshold for performance of 3 seconds.  That means that my page and all its content must be completely downloaded by the browser (IE7) within 3 seconds.  When I established this threshold I had been monitoring the performance of the page for a week and saw that the average performance was around 1.8 seconds.  Two factors are making me reconsider that threshold and these are two factors that everyone providing a web service should think about and adapt to:

  1. Content (particularly in the world of blogging) will be added to a web page/application overtime.  For example, this blog has nearly doubled in size (when I first started blogging the page download was 177951 bytes, now it’s 366514 bytes).  This is going to affect the performance of the resource (another example, adding Ajax to a website will require more JavaScript that will need to be downloaded).  Future growth needs to be considered when setting expectations for performance.
  2. Outside factors will always influence your performance, that’s why external monitoring is a must for web resources.  I should have seen it coming a mile away, this is a classic example of current events or major news events drawing hordes of web users to servers that can barely contain them (ok, ok…too dramatic, in all reality it doesn’t look like the WordPress servers were in any danger of failing).  Ultimately it brings up a very important issue.  Know what shared resources in your application can make you vulnerable to other tenant’s traffic.

These considerations will make your SLA (or any agreement you may have with users of your application) stronger and more robust.  It’s is definitely not recommended to continually adjust your SLA.  The SLA is a pact with your users and it’s best to have it remain a strong/unchangeable core.  Changing your SLA every week to keep up with performance degradations is counterproductive and at that point it doesn’t really make sense to even provide an SLA.  So I have two options: first I could change my SLA/Timeout value from 3 seconds to 4 or 5 seconds, this would definitely reduce the number of alerts I’m receiving (it would also weaken my stance on what I promise to my readers).  Second, I can keep my Timeout value at 3 seconds and see what happens with the performance of the site (and even proactively improve the performance of the site).  At this point I’m not too concerned about the alerts and I think that as the election hype dies down I will see a reduction in alerting (which is the case over the last 24 hours).  It may not even be necessary at this point to make any improvements to the site.  Though it’s certainly something that will need to be considered in the future as I continue to post content.

Blog at WordPress.com.