The PerfMon Blog

July 23, 2009

The Cost of Poor Performance

Phil Dixon recently gave a great talk at Velocity 2009 talking about the cost of poor performance. Through a series of rigorous tests, measurements and reports, Shopzilla was able to quantify the cost of performance to the bottom line of the business. Performance monitoring is often thought of as insurance, you pay a small premium every month so that you are notified when the site down. But it is increasingly becoming much more than this. With the variety of tools and APIs available now on multiple platforms, it’s possible (though this does require a good amount of effort) to tie performance and profitability directly together. There was another talk at Velocity that discussed the future of performance monitoring is really about being able to translate the performance and scalability costs into the direct impact to the business. In this model, the idea is to create a curve that plots costs of scalability and performance (hardware, engineering, etc.) against the impact to the bottom line (revenue, direct sales, etc.). Allistar Croll and Sean Power’s book talks a lot more about these topics in depth.

June 30, 2009

Script Recorder Best Practices

Filed under: Monitoring Best Practices — Tags: , — Tyler Fullerton @ 11:39 am

Not a major component of performance monitoring but an important one none the less is the Script Recorder.  The Script Recorder is a utility that generates the script that will be used to perform the automation of a transaction on a monitoring platform.

Generally this is a downloaded application that has two modes, record and playback (A co-worker of mine often compares it to the macro recording functionality in the Microsoft Office suite of products).  You essentially interact with the Script Recorder as though it is a browser and your actions are recorded in the background.  The playback mode will take the recorded script (or a previously recorded script) and process it, allowing the Script Recorder to playback the steps that were taken in a browser.  Playback is intended for verification.

Very rarely does the Script Recorder become a crucial need with performance monitoring.  The goal is to generate scripts, and if there is an easier way to do it without a Script Recorder then it’s irrelevant if a Script Recorder even exists.  Though some power users prefer to have a Script Recorder available if their scripts need constant updating, if they want to perform some low level QA, or if they just want to have more control over the process of the scripting.

To help in the process of researching Script Recorders I have identified some requirements that are based off of monitoring best practices.  Remember though, the Script Recorder is only a byproduct of the monitoring platform and should not be the primary deciding factor when looking for a monitoring solution.  Here are some desirable requirements of a Script Recorder:

  1. Flexibility – Script Recorder should be easy to use and allow for the ability to enter custom code/steps/instructions by hand if necessary.  The creators of a Script Recorder generally do not account for all the problems that someone may encounter when scripting, therefore it is beneficial to not limit script generation to just what the browser itself can do.  Allowing the user to enter in custom instructions can be helpful particularly when trying to clean-up before or after a script is run.
  2. Fidelity – Record functionality should be capable of recording asynchronous (and non-interactive) events such as Ajax/JavaScript.  This events are usually invisible to a user and therefore to monitoring.  If a Script Recorder is worth its salt then it should be able to track these events.
  3. Completeness – Playback functionality should be capable of playing everything back.  The monitoring platform will have to play it back so if there is a major discrepancy in playback between the Script Recorder and the actual monitoring platform then that is a sign that things are all copacetic.
  4. Openness – Ideally an open source platform is available.  The script generated by the Script Recorder should be easily portable (open-source) and easy to understand.  Immediately you have the support of an open source development community.  If your script is open source then you’re not tied down to the implementation of the Script Recorder vendor.  This will make it easier for you to up and leave if your needs change (or if you just don’t like your vendor!).

May 27, 2009

Mashing up data

Filed under: Monitoring Best Practices, Performance Monitoring — Tags: , , , , — Tyler Fullerton @ 2:49 pm

As I was digging through some old documents I came across some screenshots from a mashup application that a co-worker of mine had created.  The mashup combined analytics data from Google Analytics with performance monitoring data from Webmetrics.  Here is a screenshot from the application:

perf-a-lytics mashup

perf-a-lytics mashup

The graph shows the performance of the application (the dark blue line) graphed against the number of page views (the light green bars).  Normally, either of these data sets would be good data in their own right.  But each set of data alone leaves certain questions unanswered.

For example, if we consider only the analytics data we have answered the question How many users are on my site at any given time? But a very important follow up question that remains unanswered is What impact does that have on my site’s performance? The addition of the performance data answers that second question.  The performance data shows us that there is a considerable increase in page load time (roughly 4x more time to load the same page).

What about if we consider the performance monitoring data alone?  In this case we answer the question What is the performance of my application at any given time? We can tell what our performance is but there the question that goes unanswered is; What impact does that have on how people use/access my site? The answer that we get by adding the analytics data is that people tend to leave the site when the performance degrades, that’s exactly what we see in this graph, as the users leave the site (most likely upset) the site calms down because of fewer requests.  One other thing to note is that we also see what is the cause of a performance degradation.  We see that the increase in users has resulted in an increase in page load time (because the server has to use the same amount of resources to fulfill more requests).

This is just a small example of what can be done by combining data sets.  I would recommend that whenever you look at gathering data with a tool (whether it be analytics, performance data, etc.) you should make sure that an API for accessing the data is available.  This will allow you to get more value out of your tool and also will open new doors to you as you add tools in the future.

Sample mashup courtesy of Lenny.

May 12, 2009

The Value of Monitoring

As I was sitting at my desk this morning processing email I was amazed at the number of questions I was getting about basic value propositions of external performance monitoring.  I thought maybe it’s not as clear as I suspect it is. So I took off my Engineering hat (which often causes me to be derisive and not understanding of the lack of knowledge) and instead put on my Sales hat (which makes me excited to share information).  So here it is, a quick summary of the value of external performance monitoring:

  • Alerting – This should always be first.  A good portion of the value in performance monitoring comes from being able to evaluate the data collected.  But, the most valuable functionality comes from being able to proactively alert yourself as problems are occurring.  This allows you to reduce downtime, meet SLAs, and investigate problems as they’re occurring.  Central to this is a monitoring frequency that will help accomplish this.  If you only monitor once an hour then there is going to be some noticable lag in how quickly you can respond to a problem.  If you monitor more frequently (ex: every 5 minutes) then you can react relatively quickly.  And even though every 5 minutes is the industry standard, monitoring once every 1 minute can of even more value for your mission critical applications.
  • Reporting – This is key.  Outside of alerting, all that you have at the end of the data is a set of data and being able to draw conclusions from this data and make educated decisions about the performance of you site/application is key.  Your data set needs to be robust because you cannot go back after the fact and collect more data if you feel your data set is too sparse (you can adjust your settings to give you more data in the future but as for the past, that opportunity to collect data has come and gone).  Just like alerting, reporting requires a certain interval.  For example, you would not want to collect data once an hour and then use that data set to make adjustments to your site.
  • Monitoring Client capabilities – This one is a bit more abstract but can be summarized fairly easily.  The monitoring client is what performs the monitoring, collects the data, and perceives.  Normally such a client is implemented as either an emulation of a browser (a program that makes HTTP requests) or is an actual browser that has a framework around it to automate interactions.  The actual browser implementation of this client is the most desired solution because it’s the closest representation of what your end users are experiencing (thus it indirectly makes your data set all the more accurate).  The emulated browser is generally better fit for the cost conscious monitoring consumer as it’s only an approximation of what your end user experiences (of the deficiencies of an emulated browser, lack of JavaScript support is probably the most egregious).

So there it is….when you’re shopping around for a monitoring solution you should consider the above three values.  You can then take those and sort them in order of importance for the objectives you plan to achieve with monitoring.  I generally consider the monitoring client to be the most important factor because it generates the data that is going to be by all other monitoring components (the data collected is more accurate and representative of your users which in turn makes your reports more valuable and makes your alerts more accurate).

November 12, 2008

How to define a monitoring transaction

The first step in monitoring the performance of an on line asset is to determine what really needs to be monitored.  This is usually pretty easy if you’re just looking to monitor the availability and performance of a website.  In this case you just need to monitor a single URL (or a subset of the most important URLs) and not every URL on the site (since all web pages are most likely hosted on the same infrastructure).  The example above shows that focusing your attention on the essentials can help reduce costs.  For example, if it costs $10 to monitor a single URL and you 1000 pages on your site, that’s $10,000 you have to cough up to monitor every page.  But the return on each page over a certain threshold is less and less (i.e. The core value you’re getting from the monitoring is that the site is up and running and if 10 pages are not available it’s more than likely that all pages are not available).  What about in the case where you want to monitor a complex web application?

It requires a similar concern because the more you monitor the more it costs.  However, in this case there is a bit more up front analysis that you have to do in order to determine the scope of what is to be monitored.  In order to monitor a web based application a script is required to tell the monitoring platform what steps to perform and how to interact with the application.  The steps that make up that script constitutes a Transaction and directly impact the cost of monitoring (most monitoring solutions will use a metered billing approach which uses a single credit per step of a transaction).  Now we can see that there is a direct correlation between how much a monitoring solution costs and the scope of that monitoring (number of steps goes up, so does the price).  It’s important to note that even if metered billing isn’t used, most monitoring platforms have a concept of increasing the price of the solution as the number of steps increases (ex: $100 for 1 to 3 steps, $200 for 4 to 6 steps, etc.).  It’s just a fact of life, those cost increases are sometimes to recoup processing power of performing the steps but mostly the increase is to offset the costs of storing, backing up, and reporting data.  So, how does one save money?  One defines a transaction as only the steps necessary to test functionality.  Often the following mistakes are made:

  • Monitoring the mundane (or for the wrong reasons).
  • Monitor duplicate functionality (too much).
  • Monitor too little (taking these recommendations too far).

Let’s look at each of these situations and see how they impact costs as well as how they affect the bottom line (collecting actionable monitoring data).  In each case we will consider only monitoring of transactions:

Monitoring the mundane – This is generally the product of an organization that hasn’t thoroughly thought out the goals of monitoring.  A transaction that I would consider mundane is one that doesn’t really have an end goal and just meanders around the website.  For example, a transaction that clicks through each menu item in the left nav bar is probably mundane.  Sure there’s an argument for why to do that, maybe lots of revenue is generated from the left nav bar, or maybe that’s the only navigation available for the site.  But in actuality this is really a QA problem and should be addressed as such.  It’s common in the field of computer science that the later a problem is discovered the more it’s going to cost to fix it.  Which is definitely the case here: A QA process that occurs right after development could have caught any broken links or JavaScript funkyness more efficiently then a costly monitoring solution after code has been deployed.

Monitor duplicate functionality – Sometimes this is a hard one to get around.  But basically you need to make sure that your monitoring transactions are mutually exclusive.  Don’t monitor the updating of a web based calendar in two separate transactions when one will do.  Another case is when similar methods are invoked in a single transaction.  For example, if you have a tool that configures a product and does so in 20 steps it’s probably overkill to perform all configurations (since they all probably access the same front-end and back-end functionality).  Have the transaction perform a couple of configurations and then complete the transaction (i.e. purchase, or whatever the result of the transaction is).  In this last case the duplicate functionality is a bit obscure…on the front-end the functionality looks different (configure a tire vs configure a stereo) but on the back-end the functionality is more than likely the same (accessing the same database through the same web service) and therefore all you’re really doing is testing more client side code execution (which probably should have been done during the QA process again).

Monitor too little – If you start to get too carried away with the recommendations I’ve made you could end up shooting yourself in the foot.  For example, in the last section I gave the example of an application that configures a product, furthermore that application uses the same back-end technology for each step of the configuration.  But it very well could be the case that third party functionality is embedded in the configuration tool (one step could be hosted by you while another step makes a request to a third party).  In that case maybe it does make more sense to monitor additional steps (though it could probably be monitored more efficiently by breaking out that third party contents monitoring into it’s own monitoring service).  The end result is that you’re looking for efficiencies in monitoring that will help reduce cost while NOT altering the data set you expect to get from the monitoring.

To summarize, you want to focus your monitoring so that you can achive your goals in improving performance without creating convoluted and expensive data sets.  Also, you want to be aware of not getting to zealous with efficiency and stripping your dataset of all its value.

November 5, 2008

Blog Performance

I’ve been receiving a lot of alerts from my monitoring system for this blog so I thought I’d do a bit of investigation.  I checked my monitoring data to see if there was a trend in performance degradation and sure enough I found it in a graph that lays out the page load time and errors.  Here’s what it looks like:

Performance Metrics

Performance Metrics

It doesn’t show any major overall performance increase (that is a left to right increase in the value of the dark green line) but I do notice that errors have started to increase with a culmination of 31 errors on Tuesday (11/4).  Though I don’t have any analytics data to prove this I’m pretty sure it had to do with all the blogging (and blog reading) that was going on nationwide with the presidential election.  All that traffic on WordPress certainly would have affected my blogs performance (or at least mine and every other blog that is on the same hardware).  It’s not just WordPress but other sites containing news about the election were hammered with users looking for election results (here’s an article on Akamai’s traffic for yesterday).

When I drilled down into the monitoring results I found that the errors I was experiencing were due to timeout errors.  In my monitoring software i’ve established a threshold for performance of 3 seconds.  That means that my page and all its content must be completely downloaded by the browser (IE7) within 3 seconds.  When I established this threshold I had been monitoring the performance of the page for a week and saw that the average performance was around 1.8 seconds.  Two factors are making me reconsider that threshold and these are two factors that everyone providing a web service should think about and adapt to:

  1. Content (particularly in the world of blogging) will be added to a web page/application overtime.  For example, this blog has nearly doubled in size (when I first started blogging the page download was 177951 bytes, now it’s 366514 bytes).  This is going to affect the performance of the resource (another example, adding Ajax to a website will require more JavaScript that will need to be downloaded).  Future growth needs to be considered when setting expectations for performance.
  2. Outside factors will always influence your performance, that’s why external monitoring is a must for web resources.  I should have seen it coming a mile away, this is a classic example of current events or major news events drawing hordes of web users to servers that can barely contain them (ok, ok…too dramatic, in all reality it doesn’t look like the WordPress servers were in any danger of failing).  Ultimately it brings up a very important issue.  Know what shared resources in your application can make you vulnerable to other tenant’s traffic.

These considerations will make your SLA (or any agreement you may have with users of your application) stronger and more robust.  It’s is definitely not recommended to continually adjust your SLA.  The SLA is a pact with your users and it’s best to have it remain a strong/unchangeable core.  Changing your SLA every week to keep up with performance degradations is counterproductive and at that point it doesn’t really make sense to even provide an SLA.  So I have two options: first I could change my SLA/Timeout value from 3 seconds to 4 or 5 seconds, this would definitely reduce the number of alerts I’m receiving (it would also weaken my stance on what I promise to my readers).  Second, I can keep my Timeout value at 3 seconds and see what happens with the performance of the site (and even proactively improve the performance of the site).  At this point I’m not too concerned about the alerts and I think that as the election hype dies down I will see a reduction in alerting (which is the case over the last 24 hours).  It may not even be necessary at this point to make any improvements to the site.  Though it’s certainly something that will need to be considered in the future as I continue to post content.

October 23, 2008

5 Key Graphs

Filed under: Monitoring Best Practices, Performance Monitoring — Tags: , , , , , — Tyler Fullerton @ 11:32 am

Over the past week I’ve been developing a presentation/demonstration for training people on performance monitoring using the Webmetrics GlobalWatch platform for monitoring.  As part of that training I identified 5 key graphs that prove to be invaluable when analyzing performance data.  To present these graphs I developed a slide that would allow my audience to draw by hand the 5 graphs and list the value that each graph provides.  So here is a copy of the slide filled out by me:

5 Key Graphs

5 Key Graphs

To clarify, these graphs are specific to a monitoring service that monitors a multiple step transaction (ex: purchase on Amazon.com, user log-in, or user registration) however similar graphs should be available for other types of monitoring (stream performance, web service monitor, URL monitor, DNS monitoring).  Basically, I recommend that whenever you look to implement performance monitoring you should make sure that the presentation layer of that monitoring allows you (at least) to view your data in fashions similar to these 5 graphs:

  1. Transaction Average Load Time – A graph that shows you the high level view of the data you are collecting, in this case a view of the average load time throughout the day of the performing the transaction (or viewing a single URL).  This acts as an executive summary as well that helps to show basic trends in the performance of the transaction over time.  Optionally it is of value to display any errors that occurred in the graph.  Also, it helps if the graph can easily be drilled-down on so that it does not lock you into only a high level view (the drill-down will allow you to see the individual sample values that make up the displayed average).  Again, the primary benefit of this graph is a high level view that allows you to look for trends in performance.
  2. Transaction Step Averages – Another view at average data, but this time we’re drilling down to the individual steps that make up the transaction.  The example drawing shows a 6 step transaction with errors on steps 1 and 4 (and a performance bottleneck at step 4 as well).  The benefit of this graph is that you can now breakdown the performance of the steps that make up the transaction being monitored.  However, it’s still an average.  So it’s going to give us a high level view that allows us to identify what steps in a transaction can use improvements in performance as well as breakdown a complex set of data.  While errors on a per step basis should be an option to the graph drill-down capabilities would probably be overkill and presentation would be clunky at best.
  3. Transaction Steps Over Time – A graph that shows the average performance over time for each step in the transaction relative to the other steps.  This graph is similar to the first graph discussed but it breaks down the data so that we can look at trending for each individual step (as well as see how performance degradation affects individual steps in the transaction – as opposed to the transaction as a whole).  Errors should again be an optional parameter to the graph but errors should be distinguished by what step it occurred on since the primary data plotted is per step performance.  This graph would only add value for a service that monitors multiple steps (either a transaction or a number of URLs).
  4. Full-Page Breakdown – Arguably the most important graph that external performance monitoring can generate.  If you’re looking at a monitor solution that doesn’t provide it (or…gasp: doesn’t even collect full-page data) then you are not getting the true value of external performance monitoring.  The full-page breakdown is a waterfall style graph that shows the download/rendering characteristic of a web page (or pages).  The full-page displays performance data for every item that makes up a web page (images, CSS, JavaScript files, etc.).  Whenever you record performance of these items you should consider at least the basic performance metrics: HTML download, redirection, network latency, and transfer time).  The full-page breakdown is a great reflection of browser fidelity (which is a representation of how accurately a monitoring solution emulates an actual client – such as a browser).  This is generally the lowest level of granularity you can get with external monitoring and it allows you to see what impact components (JavaScript, images, etc.) have on the overall performance of your web applications and pages.
  5. Uptime & Average Load Time -  This graph is central to external monitoring solutions (and would only exist on massively deployed internal solutions).  The focus of this graph is on providing performance metrics on a per location basis.  Since external monitoring is done from global locations outside your firewall you will see different performance for different regions (samples originating further away from your servers will take longer to traverse the Internet).  Monitoring solutions that are deployed in house suffer from proximity between the resource being monitored and the tool that is monitoring…this graph will show you what is the performance from different locations (the line drawn across the graph) as well as the uptime from each location (the bars in the background of the graph).  A common usage of this graph is to evaluate the benefits of a CDN.  If a site is not using a CDN you would expect to see a rightward trend in performance improvement (that is the line representing performance would descend to the right for locations that are closer to the server being monitored).  When a CDN is used you would expect to see a very consistent performance line because accessing a site/resources provided by a CDN will reduce overall performance no matter where the client is accessing the site/resource from (CDN = content delivery/distribution network.  This is a network of servers that push content to the edges of the networks around the globe so that requests for the content don’t have to travel far…thus reducing latency times).

One important note on the last graph.  I mentioned that the Uptime & Average Load Time graph is good for evaluating CDNs, this is true if the external monitoring solution is a impartial third party.  Some monitoring services may be partnered with CDNs such that the CDN refers customers to the monitoring company in exchange for performance metrics that are skewed to show better results then what are actually achieved.  There isn’t anything tricky going on here…it simply has to do with where the monitoring agent is located in comparison to a CDN server that is providing content.  If they’re in the same data center then the performance data collected is somewhat biased and will show better performance improvements than most people will experience in using the CDN.  Definitely check and find out what the context is behind data that you’re using to evaluate CDNs or any other web technology that promises to improve your performance.

September 30, 2008

The Newbie Introduction (A Recap)

I’ve been blogging on performance monitoring for a couple months now and it dawned on me that the information presented is probably straightforward for someone who has had past experience in performance monitoring (setting up, interacting with, etc.), but for someone new to performance monitoring it may be hard to cobble together a decent understanding of performance monitoring from a bunch of scattered concepts posted on a blog.  My goal with this post is to provide a list of questions that are integral to establishing a functional performance monitoring solution.  These questions are:

  • What problems are being solved?
  • What base are we trying to solve problems for?
  • Who is involved in solving the problems?
  • What information is required to solve the problems?
  • What is the perspective needed to solve the problems?
  • How will problem solving techniques be integrated?
  • Can future (unexpected) problems be solved?

These questions revolve around the idea of solving a problem.  That is, you have a web based resource (website, application, web service, etc.) and something about it is keeping you up at night.  So, let’s go through these one at a time:

  • What problems are being solved? This is the most important question you can ask yourself, but you already knew that :) .  You need to know where you’re at before you can determine if you’re moving in the right direction.  The process of implementing performance monitoring should be broken down into distinct milestones that should be in place before you even start talking to vendors.  For example, if your answer to this question is: I want to know where on my site my users are going, where they enter the site, and where they leave the site!.  Talking to a performance monitoring vendor about this is only going to infuriate you as they try to sell you monitoring when all you really need is analytics (by the way, there are lots of really well established companies that can help you with analytics, such as: Omniture or Coremetrics).  So know the problems that you want to solve…these are your goals.  Write them down on a piece of paper and give each one a weight (critical, nice to have, not really necessary).  Since the assumption is that you don’t know anything about monitoring you’ll have to be vague.  Problems like: My site is slow, It becomes unavailable a lot and I want to know when that is, my customers complain about performance but I don’t see it, I want to track where my users are going, I want to be able to perform fail over if something happens, I want to have someone else host my site/content, and I want a cup of coffee! are what you need to concentrate on.  They help you know where you need to go and how to get there.  The first three problems can be solved with monitoring.  The fourth, fifth, sixth, and seventh are probably not problems that you would want to solve with monitoring.  So now you’re armed with information that will allow you to successfully navigate the process of picking a vendor.
  • What base are we trying to solve problems for? That is, what exactly do you want to monitor?  In an ideal world you could monitor and collect performance metrics on every page of your site, or every business process in your web application.  The problem is that the cost and management of that solution is prohibitive and the data generated (alerts, logs, and reports) is probably more than any organization would be able to handle.  So it’s clear that you need to distinguish between what you need to monitor and what you don’t need to monitor.  This really depends on your business model and commitments to your customers (and other stakeholders).  If one of the main problems you’re trying to solve (from the previous question) is manage SLA values then you need to consider monitoring only those resources that fall under the SLA.  If many resources fall under the SLA then you could potentially re-tool your SLA to take into consideration that SLA verification will be based on a subset of the services you provide (this depends on how well established your SLA already is and if your clients will allow you to amend your agreement with them).  One important attitude to have during this step is honesty with yourself.  You really need to be honest and make compromises with yourself (and your organization) as to what needs to be monitored, what would be a nice to monitor (but not necessary), and finally – what doesn’t need to be monitored.  You may even want to reach out beyond your current needs and ask others (customers, executives, etc.) what they might need performance metrics on.  They may not fit into your current budget or plans but it’s always good to know what the future holds.
  • Who is involved in solving the problems? Gather the people that are going to help you make a decision.  if you’re the head of an IT department then you’re going to want to poll your customer base (whether this is an internal or external base) and find out what they’re asking for.  Are they even concerned with performance?  If they are, how apparent are performance issues with them?  Also, find others within your organization that can help you evaluate the performance monitoring services/tools that you will be looking at.  Marketing for example may be able to make great use of the data that is collected (by the way, marketing departments are almost always beneficiaries of performance monitoring data), or your development and Q&A departments may be interested in looking at the data that is collected.  This goes all the way up to your boss, your boss’ boss, etc..  When you have consensus among individuals in your company it will help to enrich your list of requirements.  Again, you’re just trying to build consensus at this point.  You do not have to commit to any of these requirements, you’re just building an eco-system (check out some of the other blog entries for more information on this term) view of your companies performance monitoring needs.  Also, you may be able to expand your budget by doing this (how many IT departments have the same budget as the Marketing department?).
  • What information is required to solve the problems? Monitoring is all about data collection!  Sure there is definitely something to be said about how that data is collected (as has been expressed in numerous posts), but at the end of the day you’re left with a set of data.  That’s all you’ve got!  Yes, you have a control panel or some other tangible artifact but the only thing that’s going to consistently show your the ROI of performance monitoring is data (alerts, reports, graphs, logs).  So it is very important to figure out ahead of time what type of data you are looking for, how you want to display it, and how easy it is to get to data presentation that isn’t standard.
  • What is the perspective needed to solve the problems? The accuracy of monitoring is really quite subjective.  It depends on what you are willing to consider end-user perspective.  For example, you may want to consider only the HTML load time as the performance of your application because your goal is to only improve the deliver of the HTML (and no consideration is to be given to the various other components that make up the page).  Or you may want to consider everything that happens when a person uses your application (JavaScript execution, downloading images, etc.).  The importance of this is that you need to understand what you’re buying from a vendor and also that you will need to understand the context around the data that is collected.
  • How will problem solving techniques be integrated? No monitoring solution will be able to meet all your IT demands.  For example, monitoring is developed to provide accurate and reliable functionality that will alert you of issues and report on those issues as well as overall performance.  So it’s imperative to make sure that your monitoring solution can easily plug in to any existing tools (or future tools) your organization has.  This may be through technical solutions (API, SNMP msgs, etc) or procedural solutions (who gets alerts, how they react to them, decision trees, etc.).  Doing this will give creedence to a monitoring initiative and will reduce confusion once the solution is implemented.
  • Can future (unexpected) problems be solved? Often a monitoring solution will meet the inital needs but will fail to meet future needs due to the accelerated evolution of technology.  As an example, a standard monitoring solution can easily monitor a site that relies on basic HTML but will more than likely have problems with more dynamic technologies like Flash or Ajax.

July 21, 2008

Methodologies for Monitoring Performance

Filed under: Monitoring Best Practices, Performance Monitoring — Tags: , — Tyler Fullerton @ 2:08 pm

In the monitoring world there are a number of different methods for verifying the performance of a web application from an end-user perspective. One of the main considerations is whether or not you need to download all content (images, JavaScript, CSS, etc.) that is associated with the page(s). Note that this does not include the content that is part of the initial page(s) requests (Headers, HTML, inline JavaScript and CSS, etc.). My opinion on the matter is that 99% of the time you should download all content that is associated with the page(s) but there are times when you may not want to consider that content as part of performance.

You should consider downloading all content as part of a performance monitoring strategy because the goal is (I assume) to see your performance in the real world, as your customers see it. This can only be done if you consider all content that is delivered to a browser. I personally use the term Browser Fidelity to refer to the degree of accuracy (or real world representation) that a monitoring solution provides. For example, High Browser Fidelity would be accomplished by a monitoring solution that uses a real real world browser (IE, Firefox, etc.) of course this is subjective and depends on how users access your application (if your site was built specifically for Microsoft applications you would want to monitor with IE). Using an actual browser to monitor your page(s) will parse the HTML and make additional requests as well as interpret JavaScript (which can result in more requests). A Low Browser Fidelity would be achieved by a simple wget (http://en.wikipedia.org/wiki/Wget) or other similar command that just makes a basic request for a web resource but doesn’t parse the results and make additional requests (or execute JavaScript).

As an example, I’ve setup 3 services in a Webmetrics monitoring account that are monitoring www.amazon.com. This example will allow us to see what the difference is between the High Browser Fidelity and Low Browser Fidelity monitoring solutions. The three services and their degrees of Browser Fidelity are:

  1. AmazonNonFullPage – Low Browser Fidelity. This service does not parse HTML and does not execute JavaScript (or other client code) which results in a very optimistic view of performance but is not accurate, at least as far as an end user is concerned. This service is using an emulated & proprietary browser.
  2. AmazonFullPage – Mid Browser Fidelity. This service does parse HTML and will request items directly referenced in the HTML but will not execute JavaScript or make requests in external files (ex: CSS). The view of performance is better but still is not completely accurate because JavaScript and other client side code is not being executed. This services uses the same emulated & proprietary browser as the previous service but this time it is parsing through the HTML.
  3. AmazonRIA – High browser Fidelity. This service does parse the HTML, does execute JavaScript (as well as other client side code and CSS) and does render the page. This is a very accurate level of performance monitoring because of all these actions. This services uses IE 7 browsers for performing monitoring and therefore will track performance of anything that IE 7 does.

So, what are the results? The Low Browser Fidelity (AmazonNonFullpage) service reported an average performance of 1.11 seconds. The Mid Browser Fidelity (AmazonFullpage) service reported an average performance of 5.41 seconds. The High Browser Fidelity (AmazonRIA) service reported an average performance of 6.05 seconds. Each service is monitoring www.amazon.com and the discrepancy in performance has to do with how much content is being downloaded. The AmazonRIA service is downloading all content and executing JavaScript whereas the AmazonNonFullpage isn’t downloading any additional content or executing JavaScript.

The Low Browser Fidelity should only be considered when price is an issue or if you plan on buying a lot of services (100+) or if your web application is strictly functional and doesn’t have any content (images, etc.). In the majority of cases I would recommend at least the Mid Browser Fidelity solution because this gives you a pretty good ballpark representation of your web applications performance, also it’s generally cheaper than High Browser Fidelity services. I would recommend the High Browser Fidelity service if your web application utilizes a lot of JavaScript or other dynamic or client side technologies or if you are providing/tracking Service Level Agreements with your clients and third parties.

Oh yeah, performance of this site:

  • Average load time is 1.66 seconds.
  • Availability (uptime) is 100%.

But the 7 day availability is now 99.94% because of a timeout error (page took longer than 3 seconds to download) on Tuesday (7/15) at 11:15pm.

July 14, 2008

The Importance of Consistent and Frequent Performance Monitoring

Filed under: Monitoring Best Practices, Performance Monitoring — Tags: , — Tyler Fullerton @ 7:36 pm

To have a successful blog you must be consistent not only in your message but also in the frequency of your blog. Furthermore your frequency has to be granular enough to garner interest. For example, if someone only blogged once a month or maybe once a week but not consistently would you continue to read the blog? What if the underlying message wasn’t consistent (one day they’re talking about movies the next their talking about U.S. politics) would you still continue to subscribe? This idea about consistency and frequency in blog posts holds true for monitoring as well. In the world of monitoring you need to have consistent behavior and a fairly granular frequency of monitoring.

In the realm of monitoring I use the term “monitoring interval” which is the frequency with which monitoring is performed (that is, how often you check the performance and availability of a web application). Now, let’s take our two concepts; consistency and frequency and see why they’re important for monitoring. Before I do that I want to point out that there are 3 primary goals that are commonly achieved with monitoring: real-time alerting, trending on historical data, and reporting on SLAs (Service Level Agreements). Back to the concepts, here is why they’re important:

  • Consistency: Using the same monitoring interval (that is, not varying the frequency of the monitoring) will assure you that you will have samples of data for any given time period and that you will not be surprised by any periods of time where no samples or few samples are taken. This is critical for real-time alerting, searching for trends in data, and SLA reporting.
  • Frequency: The frequency is what determines how quickly we can react to issues and it also allows us to have a fine level of granularity to our data.

Therefore the frequency is imperative to real-time alerting as well as SLA reporting (more on this in a bit). Now, let’s look at the different levels of consistency and frequency and examples. I was going to draw a diagram of this but didn’t have time so I will try to post that at a later date, but here is a text based breakdown starting with no consistency (or frequency) and progressing to high consistency (or frequency):

  1. No consistency/frequency – Just don’t monitor in any way what so ever. The only consistency is that there is no monitoring being performed. You have to rely on users of your site to inform you of issues (resulting in brand damage and poor user experience) and your ability to address issues is reduced almost to zero and there is no potential to report on SLA or performance.
  2. Low consistency/frequency – When you have some free time (or maybe on a daily frequency) you run through the transaction by hand in a browser and, if you’re feeling intrepid then you log the data in an Excel spreadsheet. The impact is that you can’t do real-time alerting, and your ability to trend on data will be minimal because you don’t have enough data to support any initiatives that would result in analyzing the data. Rudimentary SLA reporting can be done but would be of little or no value.
  3. Moderate consistency/frequency – Some sort of monitoring tool is used to check the web application in an automated way but the frequency is set to 1 hour. Good start because you are now performing monitoring on an automatic basis however your granularity is too coarse which doesn’t allow you to perform real-time alerting. You can start to get an idea of trending the data (though it will probably frustrate you), and the SLA reporting that you could do will be just enough to cause your users to laugh and not take you seriously.
  4. High consistency/frequency – Monitoring is performed from a tool that automates the process (similar to the previous point) however the monitoring interval (frequency) has a finer granularity. For example, that frequency is usually every 5 minutes (but could be lower). You have consistent and reliable monitoring and your frequency is just right so that you can be alerted of issues with very little latency (relevant within ~5 minutes), your data-set is complete enough to allow you to make decisions based on previous trends in performance, and you can provide SLAs that will have meaning to your end-users.

When implementing a monitoring solution for a web application it is imperative that you consider the consistency of the monitoring solution as well as the frequency at which you will monitor. Without these considerations, the data you’re collecting may not be of any relevance or use to you. Oh, yeah, one other thing. I’m still monitoring this site…i’d like to see how the performance changes over time so I’m going to continue to post the performance at the end of each blog entry.

  • Average load time is 1.69 seconds.
  • Availability (uptime) is 100%.

Theme: Silver is the New Black. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.