Phil Dixon recently gave a great talk at Velocity 2009 talking about the cost of poor performance. Through a series of rigorous tests, measurements and reports, Shopzilla was able to quantify the cost of performance to the bottom line of the business. Performance monitoring is often thought of as insurance, you pay a small premium every month so that you are notified when the site down. But it is increasingly becoming much more than this. With the variety of tools and APIs available now on multiple platforms, it’s possible (though this does require a good amount of effort) to tie performance and profitability directly together. There was another talk at Velocity that discussed the future of performance monitoring is really about being able to translate the performance and scalability costs into the direct impact to the business. In this model, the idea is to create a curve that plots costs of scalability and performance (hardware, engineering, etc.) against the impact to the bottom line (revenue, direct sales, etc.). Allistar Croll and Sean Power’s book talks a lot more about these topics in depth.
July 23, 2009
November 12, 2008
How to define a monitoring transaction
The first step in monitoring the performance of an on line asset is to determine what really needs to be monitored. This is usually pretty easy if you’re just looking to monitor the availability and performance of a website. In this case you just need to monitor a single URL (or a subset of the most important URLs) and not every URL on the site (since all web pages are most likely hosted on the same infrastructure). The example above shows that focusing your attention on the essentials can help reduce costs. For example, if it costs $10 to monitor a single URL and you 1000 pages on your site, that’s $10,000 you have to cough up to monitor every page. But the return on each page over a certain threshold is less and less (i.e. The core value you’re getting from the monitoring is that the site is up and running and if 10 pages are not available it’s more than likely that all pages are not available). What about in the case where you want to monitor a complex web application?
It requires a similar concern because the more you monitor the more it costs. However, in this case there is a bit more up front analysis that you have to do in order to determine the scope of what is to be monitored. In order to monitor a web based application a script is required to tell the monitoring platform what steps to perform and how to interact with the application. The steps that make up that script constitutes a Transaction and directly impact the cost of monitoring (most monitoring solutions will use a metered billing approach which uses a single credit per step of a transaction). Now we can see that there is a direct correlation between how much a monitoring solution costs and the scope of that monitoring (number of steps goes up, so does the price). It’s important to note that even if metered billing isn’t used, most monitoring platforms have a concept of increasing the price of the solution as the number of steps increases (ex: $100 for 1 to 3 steps, $200 for 4 to 6 steps, etc.). It’s just a fact of life, those cost increases are sometimes to recoup processing power of performing the steps but mostly the increase is to offset the costs of storing, backing up, and reporting data. So, how does one save money? One defines a transaction as only the steps necessary to test functionality. Often the following mistakes are made:
- Monitoring the mundane (or for the wrong reasons).
- Monitor duplicate functionality (too much).
- Monitor too little (taking these recommendations too far).
Let’s look at each of these situations and see how they impact costs as well as how they affect the bottom line (collecting actionable monitoring data). In each case we will consider only monitoring of transactions:
Monitoring the mundane – This is generally the product of an organization that hasn’t thoroughly thought out the goals of monitoring. A transaction that I would consider mundane is one that doesn’t really have an end goal and just meanders around the website. For example, a transaction that clicks through each menu item in the left nav bar is probably mundane. Sure there’s an argument for why to do that, maybe lots of revenue is generated from the left nav bar, or maybe that’s the only navigation available for the site. But in actuality this is really a QA problem and should be addressed as such. It’s common in the field of computer science that the later a problem is discovered the more it’s going to cost to fix it. Which is definitely the case here: A QA process that occurs right after development could have caught any broken links or JavaScript funkyness more efficiently then a costly monitoring solution after code has been deployed.
Monitor duplicate functionality – Sometimes this is a hard one to get around. But basically you need to make sure that your monitoring transactions are mutually exclusive. Don’t monitor the updating of a web based calendar in two separate transactions when one will do. Another case is when similar methods are invoked in a single transaction. For example, if you have a tool that configures a product and does so in 20 steps it’s probably overkill to perform all configurations (since they all probably access the same front-end and back-end functionality). Have the transaction perform a couple of configurations and then complete the transaction (i.e. purchase, or whatever the result of the transaction is). In this last case the duplicate functionality is a bit obscure…on the front-end the functionality looks different (configure a tire vs configure a stereo) but on the back-end the functionality is more than likely the same (accessing the same database through the same web service) and therefore all you’re really doing is testing more client side code execution (which probably should have been done during the QA process again).
Monitor too little – If you start to get too carried away with the recommendations I’ve made you could end up shooting yourself in the foot. For example, in the last section I gave the example of an application that configures a product, furthermore that application uses the same back-end technology for each step of the configuration. But it very well could be the case that third party functionality is embedded in the configuration tool (one step could be hosted by you while another step makes a request to a third party). In that case maybe it does make more sense to monitor additional steps (though it could probably be monitored more efficiently by breaking out that third party contents monitoring into it’s own monitoring service). The end result is that you’re looking for efficiencies in monitoring that will help reduce cost while NOT altering the data set you expect to get from the monitoring.
To summarize, you want to focus your monitoring so that you can achive your goals in improving performance without creating convoluted and expensive data sets. Also, you want to be aware of not getting to zealous with efficiency and stripping your dataset of all its value.
November 5, 2008
Blog Performance
I’ve been receiving a lot of alerts from my monitoring system for this blog so I thought I’d do a bit of investigation. I checked my monitoring data to see if there was a trend in performance degradation and sure enough I found it in a graph that lays out the page load time and errors. Here’s what it looks like:
It doesn’t show any major overall performance increase (that is a left to right increase in the value of the dark green line) but I do notice that errors have started to increase with a culmination of 31 errors on Tuesday (11/4). Though I don’t have any analytics data to prove this I’m pretty sure it had to do with all the blogging (and blog reading) that was going on nationwide with the presidential election. All that traffic on WordPress certainly would have affected my blogs performance (or at least mine and every other blog that is on the same hardware). It’s not just WordPress but other sites containing news about the election were hammered with users looking for election results (here’s an article on Akamai’s traffic for yesterday).
When I drilled down into the monitoring results I found that the errors I was experiencing were due to timeout errors. In my monitoring software i’ve established a threshold for performance of 3 seconds. That means that my page and all its content must be completely downloaded by the browser (IE7) within 3 seconds. When I established this threshold I had been monitoring the performance of the page for a week and saw that the average performance was around 1.8 seconds. Two factors are making me reconsider that threshold and these are two factors that everyone providing a web service should think about and adapt to:
- Content (particularly in the world of blogging) will be added to a web page/application overtime. For example, this blog has nearly doubled in size (when I first started blogging the page download was 177951 bytes, now it’s 366514 bytes). This is going to affect the performance of the resource (another example, adding Ajax to a website will require more JavaScript that will need to be downloaded). Future growth needs to be considered when setting expectations for performance.
- Outside factors will always influence your performance, that’s why external monitoring is a must for web resources. I should have seen it coming a mile away, this is a classic example of current events or major news events drawing hordes of web users to servers that can barely contain them (ok, ok…too dramatic, in all reality it doesn’t look like the WordPress servers were in any danger of failing). Ultimately it brings up a very important issue. Know what shared resources in your application can make you vulnerable to other tenant’s traffic.
These considerations will make your SLA (or any agreement you may have with users of your application) stronger and more robust. It’s is definitely not recommended to continually adjust your SLA. The SLA is a pact with your users and it’s best to have it remain a strong/unchangeable core. Changing your SLA every week to keep up with performance degradations is counterproductive and at that point it doesn’t really make sense to even provide an SLA. So I have two options: first I could change my SLA/Timeout value from 3 seconds to 4 or 5 seconds, this would definitely reduce the number of alerts I’m receiving (it would also weaken my stance on what I promise to my readers). Second, I can keep my Timeout value at 3 seconds and see what happens with the performance of the site (and even proactively improve the performance of the site). At this point I’m not too concerned about the alerts and I think that as the election hype dies down I will see a reduction in alerting (which is the case over the last 24 hours). It may not even be necessary at this point to make any improvements to the site. Though it’s certainly something that will need to be considered in the future as I continue to post content.
September 30, 2008
The Newbie Introduction (A Recap)
I’ve been blogging on performance monitoring for a couple months now and it dawned on me that the information presented is probably straightforward for someone who has had past experience in performance monitoring (setting up, interacting with, etc.), but for someone new to performance monitoring it may be hard to cobble together a decent understanding of performance monitoring from a bunch of scattered concepts posted on a blog. My goal with this post is to provide a list of questions that are integral to establishing a functional performance monitoring solution. These questions are:
- What problems are being solved?
- What base are we trying to solve problems for?
- Who is involved in solving the problems?
- What information is required to solve the problems?
- What is the perspective needed to solve the problems?
- How will problem solving techniques be integrated?
- Can future (unexpected) problems be solved?
These questions revolve around the idea of solving a problem. That is, you have a web based resource (website, application, web service, etc.) and something about it is keeping you up at night. So, let’s go through these one at a time:
- What problems are being solved? This is the most important question you can ask yourself, but you already knew that
. You need to know where you’re at before you can determine if you’re moving in the right direction. The process of implementing performance monitoring should be broken down into distinct milestones that should be in place before you even start talking to vendors. For example, if your answer to this question is: I want to know where on my site my users are going, where they enter the site, and where they leave the site!. Talking to a performance monitoring vendor about this is only going to infuriate you as they try to sell you monitoring when all you really need is analytics (by the way, there are lots of really well established companies that can help you with analytics, such as: Omniture or Coremetrics). So know the problems that you want to solve…these are your goals. Write them down on a piece of paper and give each one a weight (critical, nice to have, not really necessary). Since the assumption is that you don’t know anything about monitoring you’ll have to be vague. Problems like: My site is slow, It becomes unavailable a lot and I want to know when that is, my customers complain about performance but I don’t see it, I want to track where my users are going, I want to be able to perform fail over if something happens, I want to have someone else host my site/content, and I want a cup of coffee! are what you need to concentrate on. They help you know where you need to go and how to get there. The first three problems can be solved with monitoring. The fourth, fifth, sixth, and seventh are probably not problems that you would want to solve with monitoring. So now you’re armed with information that will allow you to successfully navigate the process of picking a vendor. - What base are we trying to solve problems for? That is, what exactly do you want to monitor? In an ideal world you could monitor and collect performance metrics on every page of your site, or every business process in your web application. The problem is that the cost and management of that solution is prohibitive and the data generated (alerts, logs, and reports) is probably more than any organization would be able to handle. So it’s clear that you need to distinguish between what you need to monitor and what you don’t need to monitor. This really depends on your business model and commitments to your customers (and other stakeholders). If one of the main problems you’re trying to solve (from the previous question) is manage SLA values then you need to consider monitoring only those resources that fall under the SLA. If many resources fall under the SLA then you could potentially re-tool your SLA to take into consideration that SLA verification will be based on a subset of the services you provide (this depends on how well established your SLA already is and if your clients will allow you to amend your agreement with them). One important attitude to have during this step is honesty with yourself. You really need to be honest and make compromises with yourself (and your organization) as to what needs to be monitored, what would be a nice to monitor (but not necessary), and finally – what doesn’t need to be monitored. You may even want to reach out beyond your current needs and ask others (customers, executives, etc.) what they might need performance metrics on. They may not fit into your current budget or plans but it’s always good to know what the future holds.
- Who is involved in solving the problems? Gather the people that are going to help you make a decision. if you’re the head of an IT department then you’re going to want to poll your customer base (whether this is an internal or external base) and find out what they’re asking for. Are they even concerned with performance? If they are, how apparent are performance issues with them? Also, find others within your organization that can help you evaluate the performance monitoring services/tools that you will be looking at. Marketing for example may be able to make great use of the data that is collected (by the way, marketing departments are almost always beneficiaries of performance monitoring data), or your development and Q&A departments may be interested in looking at the data that is collected. This goes all the way up to your boss, your boss’ boss, etc.. When you have consensus among individuals in your company it will help to enrich your list of requirements. Again, you’re just trying to build consensus at this point. You do not have to commit to any of these requirements, you’re just building an eco-system (check out some of the other blog entries for more information on this term) view of your companies performance monitoring needs. Also, you may be able to expand your budget by doing this (how many IT departments have the same budget as the Marketing department?).
- What information is required to solve the problems? Monitoring is all about data collection! Sure there is definitely something to be said about how that data is collected (as has been expressed in numerous posts), but at the end of the day you’re left with a set of data. That’s all you’ve got! Yes, you have a control panel or some other tangible artifact but the only thing that’s going to consistently show your the ROI of performance monitoring is data (alerts, reports, graphs, logs). So it is very important to figure out ahead of time what type of data you are looking for, how you want to display it, and how easy it is to get to data presentation that isn’t standard.
- What is the perspective needed to solve the problems? The accuracy of monitoring is really quite subjective. It depends on what you are willing to consider end-user perspective. For example, you may want to consider only the HTML load time as the performance of your application because your goal is to only improve the deliver of the HTML (and no consideration is to be given to the various other components that make up the page). Or you may want to consider everything that happens when a person uses your application (JavaScript execution, downloading images, etc.). The importance of this is that you need to understand what you’re buying from a vendor and also that you will need to understand the context around the data that is collected.
- How will problem solving techniques be integrated? No monitoring solution will be able to meet all your IT demands. For example, monitoring is developed to provide accurate and reliable functionality that will alert you of issues and report on those issues as well as overall performance. So it’s imperative to make sure that your monitoring solution can easily plug in to any existing tools (or future tools) your organization has. This may be through technical solutions (API, SNMP msgs, etc) or procedural solutions (who gets alerts, how they react to them, decision trees, etc.). Doing this will give creedence to a monitoring initiative and will reduce confusion once the solution is implemented.
- Can future (unexpected) problems be solved? Often a monitoring solution will meet the inital needs but will fail to meet future needs due to the accelerated evolution of technology. As an example, a standard monitoring solution can easily monitor a site that relies on basic HTML but will more than likely have problems with more dynamic technologies like Flash or Ajax.
September 9, 2008
Active and Passive Monitoring Solutions
Note: My brain must be on the fritz. Dear readers, I have updated the post as I completely misused the terminology and definitions of Active and Passive monitoring. I apologize for any inconvenience and have updated this post as of September 22nd (3pm PST).
I’ve been asked quite a few times about the distinctions between active and passive monitoring and which is the best method to consider when implementing a monitoring methodology. In this post I’d like to provide a basic introduction to the two types of monitoring and talk briefly about their benefits and deficiencies.
First, let’s start with definitions of these terms:
- Passive Monitoring – Performance/Availability monitoring that uses data sets generated from actual human users of a website or web application.
- Active Monitoring – Performance/Availability monitoring that uses data sets that are generated by a consistent and automated user of a website or web application.
We can see from these definitions that in one case (Passive Monitoring) we are relying on the real world experiences of the existing user base for the website/application, similar in fashion to how web analytic data is collected. In the other case (Active Monitoring) we are relying on the experience of a synthetic user (a piece of software that emulates an end-user’s interaction with a website/application). Let’s start our analysis of the two methodologies by looking at their similar properties:
- Both can provide the same statistics (uptime, availability, errors, throughput, and other performance metrics). Essentially, neither is limited to the basic data sets of performance monitoring solutions.
- Both will reflect accurate measurements that will represent the performance of the server at the time the sample was taken. Stated differently, if the infrastructure for the website/application is under duress then the impact will be reflected in the data that is collected by the monitoring solution. There are fringe cases where this concept breaks down for Passive Monitoring that we will discuss below.
What about the differences in these monitoring methodologies? Here are the basic properties of a Active Monitoring solution:
- Monitoring is performed from an emulated user. This can be as simple as an automated process that makes base level HTTP requests (ex: Unix wget commands) or can be a complex solution using actual browsers for performing monitoring. In either case, we are talking about a user of the website/application that is strictly software based.
- Monitoring is consistent throughout the day and will always attempt to monitor regardless of the state of the website/application infrastructure.
- Monitoring is consistent in configuration of the monitoring environment. That is, every time you monitor the user (automated process) is the same.
And for Passive Monitoring solutions, the properties are:
- Monitoring is performed by actual (human) users. This is done by execution of JavaScript code embedded in the website/application that track the performance that the end user sees while accessing the site.
- Monitoring reflects the actual usage parameters of the end users (ex: browser type, configuration, platform, etc.). This is another way of saying that the end user perspective is accurately represented.
- Monitoring will adapt to the demographic of the users of the website/application.
The end goal of monitoring is really going to be the driving force in dictating which solution is the best. Some companies may want to be able to record the experiences’ of their actual users, in this case Passive Monitoring is the appropriate solution. Passive Monitoring will allow the company to collect samples on performance that are actionable in the sense that they can see what type of browsers are being used, what platforms are most important, which problems are certain users having, how is performance from an exact location. This can help direct the companies efforts when it comes to initial development and improvements of a website/application. The Active Monitoring solution is more in tune with the task of reporting on performance to management, ensuring availability, and tracking SLAs because it is more consistent and has a reliable monitoring base that will not change over time.
Each solution has its faults as well, for Active Monitoring the faults are:
- Does not track experiences of actual users accessing the site.
- Does not provide statistics on browsers and platforms used by website/application end users.
- Does not provide last mile information.
Passive Monitoring has the following faults:
- Monitoring requires JavaScript which can alter the performance of the website/application being monitored and can potentially break or not work all together (if someone has JavaScript turned off).
- Monitoring is subject to spoofing since information about a browser, platform, and other environment variables can been altered by a malicious end user.
- Monitoring data will be sporadic and will only be collected when users are accessing the website/application. No data will be collected during times when users are not on the site. Therefore…
- Issues with the website/application will not be detected until someone accesses the site. This ability to detect problems before customers do is key and central to an on-going monitoring solution.
- If the site becomes unavailable then no monitoring will be performed because users will be unable to interact with the website/application and therefore will not be able to execute the JavaScript that will track their experience.
The final analysis is: Passive Monitoring is great for QA and development purposes. If your product is in Beta or not mature/critical enough for an SLA then this may be the best solution because it provides statistics on how your end users experience your website/application and the specifics of their environments. However, if your application is central to your business or a certain level of service (performance and availability) is expected (even agreeded to) by your customers then you need a more consistent and robust monitoring solution. The Active Monitoring solution is far superior for these types of environments because it guarentees monitoring, consistent monitoring (you don’t have to distinguish samples based on environmental factors such as IE vs. Firefox), early alerting of problems (before your customers see those problems), and provides a basis for reporting performance and availability to others in the organization as well as tracking SLAs.
August 22, 2008
SOA Management Framework
I’ve been reading a book on SOA architectures by Nicolai M. Josuttis which provides a very accessible introduction to SOA (Service Oriented Architecture) design, benefits, and established best practices. One theme that keeps coming up is Collaboration, and in fact Nicolai states:
One key requirement for SOA is collaboration (pg 104).
The collaboration that Nicolai is talking about in the book is among isolated departments or business units within a company and is a key factor in ensuring the success of SOA (pg 104). This need to collaborate is a major driving force behind the decisions that need to be made to manage a SOA application.
Take the example of any company that has successfully impelemented a SOA based application. The company has overcome the inflexibility of large number of complex, distributed systems by creating a framework of services and processes that expose functionality to the consumers (users) of the application. This leaves the company with a number of services and processes that interact in a choreographed manner where no process has total control and all processes and services have limited knowledge of the over all application. These departments have given up some knowledge (and control) to be a part of a more flexible federated application (think about U.S History. At the Constitutional Convention of 1787 the states were basically asked to give up some of their influence and power to a centralized federal government. Some influence is gone but the resulting government is stronger and more flexible). Now there is a flexible (and scalable) infrastructure but there isn’t any unifying view of the application (other than the application itself but if you look at only the application you become ignorant to the components underneath).
In a SOA environment, monitoring of these services and processes is going to become more and more critical because of the limited scope of knowledge each department has. I think SOA based applications are still a relatively new concept that companies are experimenting with so there really hasn’t been any consideration as to how to ensure the performance of these applications and distribute the results to all interested/involved parties. What I suspect is that a need will arise (if it already hasn’t) for a platform in which all functional and non-functional requirements of a SOA base environment can be managed. I’ve heard that this isn’t even possible because of all the different ideas and methodologies for implementing SOA but it seems clear that some base framework that doesn’t contribute to the underlying architecture be present for management of the architectures requirements. What I’d really like to see is a platform that allows you to plug-in non-functional requirements (e.x. performance monitoring, SLA management, Business Process Management, etc.) as needed. A SOA management platform would help alleviate the pains that can occur when a company’s culture acts to resist collaboration. My experience with performance monitoring tells me that unless such a platform exists, there will never be widespread adoption of monitoring for SOA based applications.
