The PerfMon Blog

July 29, 2008

MobileMe or: How I Learned to Stop Worrying and Love the Load Test

Filed under: Load Testing, Performance Monitoring — Tags: , , , — Tyler Fullerton @ 2:24 pm

So, you’ve got your new iPhone 3G, you’ve waited in line at the Apple or AT&T store for hours and it was worth it. You probably bought the 8 Gig…but maybe you bought the 16 Gig (and like a fellow co-worker you bought the white version so people knew you had the 16 Gig iPhone). You’ve now got the cool new iPint application, or you’ve finally solved an age old problem with the BigTipper application. Furthermore, you’re extremely happy that you can use MobileMe to store your contacts, calendar appointments, tasks, etc. on “The Cloud” so that you can easily keep your iPhone and MacBook synchronized. Life is good, Windows is bad, and Apple rules….but, er, wait. What’s this? Why can’t I manage my contacts and why are the changes I made not propagated to my devices? And where are my emails?

I don’t care how good Apple products are. “The Cloud” is bad (not that it means to be). With all this hype that the new iPhone 3G is getting some unexpected issues arose on the Mac run me.com (MobileMe) site when it was flooded with traffic (i.e. users). These issues caused downtime, loss of functionality, and even loss of mail messages for some MobileMe users.

MobileMe is a web application that allows you to keep all your information on the web so that all your devices can be synced from one convenient location, that conveniently isn’t under your control. I don’t know what exactly Apple did to prepare for the launch of this application but it would seem that Load Testing was not performed (or not properly performed) since there was no concept of how much load the servers could handle.

A load test would have helped the IT folks at Apple understand what capacity they were currently operating at, where their likely points of failure are at (and at what levels of usage would those points fail), and what they would need (infrastructure) to surpass those limits and prevent failure. Load testing is something that should be done before any major release and ideally would be done by someone with a proficient background in load testing and the review of load test results (either someone internal or external to the company, as long as they have some experience with load testing).

Generally load tests are launched before a major release of a web application or service. A load test will consist of a number of iterations in which virtual users (the load) are applied to the server for a duration of time (usually 1 hour). during the iteration multiple types of scenarios will be run. in the case of the MobileMe application scenarios could have been:

  1. Login to MobileMe.
  2. Sync contacts on MobileMe.
  3. Update calendar event.

The first iteration of a load test would have consisted of 1000′s of virtual users (each one performing one of the above scenarios) accessing the MobileMe application. During the test, the virtual users would track performance metrics (availability and load time) for the various scenarios mentioned above. For example, as the load increases on the server we could see how the performance would degrade for someone who was trying to login to the MobileMe application while 1000′s of other requests were being served by the MobileMe server. Once this initial iteration is complete, the Apple team would make revisions to their application and architecture based on the results of the load test. After those changes are made another load test iteration could be conducted to ensure that the changes have improved performance and not caused any unattended issues. Rinse and repeat.

Here’s a link to the MobileMe status blog (http://www.apple.com/mobileme/status/) which outlines more specifics about the outage. This is a problem that all companies providing web based applications and services need to consider; it’s great when you launch an application that’s popular, it’s not so great when that popularity damages your brand and prevents your service from being usable. The iPhone is popular enough that most users will probably ignore the outage (and the lost emails between July 16th and 18th), heck, those users may even endure a couple more outages. But the majority of applications on the web do not have the brand backing of Apple/iPhone.

More on load testing offered by Webmetrics: http://www.webmetrics.com/loadtesting.html.

And here’s the performance for this blog:

  • Average load time is 1.70 seconds.
  • Availability (uptime) is 100%.

July 21, 2008

Methodologies for Monitoring Performance

Filed under: Monitoring Best Practices, Performance Monitoring — Tags: , — Tyler Fullerton @ 2:08 pm

In the monitoring world there are a number of different methods for verifying the performance of a web application from an end-user perspective. One of the main considerations is whether or not you need to download all content (images, JavaScript, CSS, etc.) that is associated with the page(s). Note that this does not include the content that is part of the initial page(s) requests (Headers, HTML, inline JavaScript and CSS, etc.). My opinion on the matter is that 99% of the time you should download all content that is associated with the page(s) but there are times when you may not want to consider that content as part of performance.

You should consider downloading all content as part of a performance monitoring strategy because the goal is (I assume) to see your performance in the real world, as your customers see it. This can only be done if you consider all content that is delivered to a browser. I personally use the term Browser Fidelity to refer to the degree of accuracy (or real world representation) that a monitoring solution provides. For example, High Browser Fidelity would be accomplished by a monitoring solution that uses a real real world browser (IE, Firefox, etc.) of course this is subjective and depends on how users access your application (if your site was built specifically for Microsoft applications you would want to monitor with IE). Using an actual browser to monitor your page(s) will parse the HTML and make additional requests as well as interpret JavaScript (which can result in more requests). A Low Browser Fidelity would be achieved by a simple wget (http://en.wikipedia.org/wiki/Wget) or other similar command that just makes a basic request for a web resource but doesn’t parse the results and make additional requests (or execute JavaScript).

As an example, I’ve setup 3 services in a Webmetrics monitoring account that are monitoring http://www.amazon.com. This example will allow us to see what the difference is between the High Browser Fidelity and Low Browser Fidelity monitoring solutions. The three services and their degrees of Browser Fidelity are:

  1. AmazonNonFullPage – Low Browser Fidelity. This service does not parse HTML and does not execute JavaScript (or other client code) which results in a very optimistic view of performance but is not accurate, at least as far as an end user is concerned. This service is using an emulated & proprietary browser.
  2. AmazonFullPage – Mid Browser Fidelity. This service does parse HTML and will request items directly referenced in the HTML but will not execute JavaScript or make requests in external files (ex: CSS). The view of performance is better but still is not completely accurate because JavaScript and other client side code is not being executed. This services uses the same emulated & proprietary browser as the previous service but this time it is parsing through the HTML.
  3. AmazonRIA – High browser Fidelity. This service does parse the HTML, does execute JavaScript (as well as other client side code and CSS) and does render the page. This is a very accurate level of performance monitoring because of all these actions. This services uses IE 7 browsers for performing monitoring and therefore will track performance of anything that IE 7 does.

So, what are the results? The Low Browser Fidelity (AmazonNonFullpage) service reported an average performance of 1.11 seconds. The Mid Browser Fidelity (AmazonFullpage) service reported an average performance of 5.41 seconds. The High Browser Fidelity (AmazonRIA) service reported an average performance of 6.05 seconds. Each service is monitoring http://www.amazon.com and the discrepancy in performance has to do with how much content is being downloaded. The AmazonRIA service is downloading all content and executing JavaScript whereas the AmazonNonFullpage isn’t downloading any additional content or executing JavaScript.

The Low Browser Fidelity should only be considered when price is an issue or if you plan on buying a lot of services (100+) or if your web application is strictly functional and doesn’t have any content (images, etc.). In the majority of cases I would recommend at least the Mid Browser Fidelity solution because this gives you a pretty good ballpark representation of your web applications performance, also it’s generally cheaper than High Browser Fidelity services. I would recommend the High Browser Fidelity service if your web application utilizes a lot of JavaScript or other dynamic or client side technologies or if you are providing/tracking Service Level Agreements with your clients and third parties.

Oh yeah, performance of this site:

  • Average load time is 1.66 seconds.
  • Availability (uptime) is 100%.

But the 7 day availability is now 99.94% because of a timeout error (page took longer than 3 seconds to download) on Tuesday (7/15) at 11:15pm.

July 14, 2008

The Importance of Consistent and Frequent Performance Monitoring

Filed under: Monitoring Best Practices, Performance Monitoring — Tags: , — Tyler Fullerton @ 7:36 pm

To have a successful blog you must be consistent not only in your message but also in the frequency of your blog. Furthermore your frequency has to be granular enough to garner interest. For example, if someone only blogged once a month or maybe once a week but not consistently would you continue to read the blog? What if the underlying message wasn’t consistent (one day they’re talking about movies the next their talking about U.S. politics) would you still continue to subscribe? This idea about consistency and frequency in blog posts holds true for monitoring as well. In the world of monitoring you need to have consistent behavior and a fairly granular frequency of monitoring.

In the realm of monitoring I use the term “monitoring interval” which is the frequency with which monitoring is performed (that is, how often you check the performance and availability of a web application). Now, let’s take our two concepts; consistency and frequency and see why they’re important for monitoring. Before I do that I want to point out that there are 3 primary goals that are commonly achieved with monitoring: real-time alerting, trending on historical data, and reporting on SLAs (Service Level Agreements). Back to the concepts, here is why they’re important:

  • Consistency: Using the same monitoring interval (that is, not varying the frequency of the monitoring) will assure you that you will have samples of data for any given time period and that you will not be surprised by any periods of time where no samples or few samples are taken. This is critical for real-time alerting, searching for trends in data, and SLA reporting.
  • Frequency: The frequency is what determines how quickly we can react to issues and it also allows us to have a fine level of granularity to our data.

Therefore the frequency is imperative to real-time alerting as well as SLA reporting (more on this in a bit). Now, let’s look at the different levels of consistency and frequency and examples. I was going to draw a diagram of this but didn’t have time so I will try to post that at a later date, but here is a text based breakdown starting with no consistency (or frequency) and progressing to high consistency (or frequency):

  1. No consistency/frequency – Just don’t monitor in any way what so ever. The only consistency is that there is no monitoring being performed. You have to rely on users of your site to inform you of issues (resulting in brand damage and poor user experience) and your ability to address issues is reduced almost to zero and there is no potential to report on SLA or performance.
  2. Low consistency/frequency – When you have some free time (or maybe on a daily frequency) you run through the transaction by hand in a browser and, if you’re feeling intrepid then you log the data in an Excel spreadsheet. The impact is that you can’t do real-time alerting, and your ability to trend on data will be minimal because you don’t have enough data to support any initiatives that would result in analyzing the data. Rudimentary SLA reporting can be done but would be of little or no value.
  3. Moderate consistency/frequency – Some sort of monitoring tool is used to check the web application in an automated way but the frequency is set to 1 hour. Good start because you are now performing monitoring on an automatic basis however your granularity is too coarse which doesn’t allow you to perform real-time alerting. You can start to get an idea of trending the data (though it will probably frustrate you), and the SLA reporting that you could do will be just enough to cause your users to laugh and not take you seriously.
  4. High consistency/frequency – Monitoring is performed from a tool that automates the process (similar to the previous point) however the monitoring interval (frequency) has a finer granularity. For example, that frequency is usually every 5 minutes (but could be lower). You have consistent and reliable monitoring and your frequency is just right so that you can be alerted of issues with very little latency (relevant within ~5 minutes), your data-set is complete enough to allow you to make decisions based on previous trends in performance, and you can provide SLAs that will have meaning to your end-users.

When implementing a monitoring solution for a web application it is imperative that you consider the consistency of the monitoring solution as well as the frequency at which you will monitor. Without these considerations, the data you’re collecting may not be of any relevance or use to you. Oh, yeah, one other thing. I’m still monitoring this site…i’d like to see how the performance changes over time so I’m going to continue to post the performance at the end of each blog entry.

  • Average load time is 1.69 seconds.
  • Availability (uptime) is 100%.

July 11, 2008

Performance Monitoring

Filed under: Performance Monitoring — Tags: , , — Tyler Fullerton @ 8:43 am

Hi, Welcome to the Performance Monitoring blog! This blog is geared towards performance issues related to web based applications. The reason Monitoring appears in the title is because I work at a San Diego based company (Webmetrics) that provides services and products that perform external monitoring of web applications and therefore a lot of my experience and knowledge comes from this background. My goal is to educate developers, IT folks, and Business folks on the merits of performance monitoring and why it’s critical to today’s web applications. Discussions on performance monitoring will generally have a slant towards: Their impact on business or the impact that web technologies have on performance.

Just a bit about myself, I started at Webmetrics as an intern while attending UCSD in 2005. After graduating I became a full-time Software Engineer with Webmetrics but quickly became isolated from the goals I was trying to achieve with the software I was developing. I had no insight into how customers were actually using the product (part of this had to do with the fact that there was no Sales Engineer at Webmetrics at that time). I was approached to fill the void of the Sales Engineer position and didn’t think twice about it (though it was extremely hard at first), finally feeling that I was now on the front lines and able to learn more about performance monitoring then I ever could have as a developer.

Seeing that this discussion is about performance monitoring I figured a first good step would be to setup external monitoring for this blog. I setup a Webmetrics (www.webmetrics.com) monitoring service which will access my blog every 5 minutes from an external location (US locations only). The monitoring service is composed of a network of worldwide monitoring agents (servers) that will open an IE7 browser and navigate to perfmonblog.wordpress.com verifying both the content on the page and the performance (load time and availability) of the page. So far it’s looking pretty good (even though I don’t have a lot of content yet), the metrics for today are:

  • Average load time is 1.70 seconds.
  • Availability (uptime) is 100%.

Theme: Silver is the New Black. Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.